sklift.datasets.fetch_criteo

sklift.datasets.datasets.fetch_criteo(target_col='visit', treatment_col='treatment', data_home=None, dest_subdir=None, download_if_missing=True, percent10=True, return_X_y_t=False, as_frame=True)[source]

Load and return the Criteo Uplift Prediction Dataset (classification).

This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising.

Major columns:

  • treatment (binary): treatment

  • exposure (binary): treatment

  • visit (binary): target

  • conversion (binary): target

  • f0, ... , f11 (float): feature values

Read more in the docs.

Parameters
  • target_col (string, 'visit' or 'conversion', default='visit') – Selects which column from dataset will be target.

  • treatment_col (string,'treatment' or 'exposure' default='treatment') – Selects which column from dataset will be treatment.

  • data_home (string) – Specify a download and cache folder for the datasets.

  • dest_subdir (string) – The name of the folder in which the dataset is stored.

  • download_if_missing (bool, default=True) – If False, raise an IOError if the data is not locally available instead of trying to download the data from the source site.

  • percent10 (bool, default=True) – Whether to load only 10 percent of the data.

  • return_X_y_t (bool, default=False) – If True, returns (data, target, treatment) instead of a Bunch object.

  • as_frame (bool) – If True, returns a pandas Dataframe or Series for the data, target and treatment objects in the Bunch returned object; Bunch return object will also have a frame member.

Returns

dataset.

Bunch:

By default dictionary-like object, with the following attributes:

  • data (ndarray or DataFrame object): Dataset without target and treatment.

  • target (Series object): Column target by values.

  • treatment (Series object): Column treatment by values.

  • DESCR (str): Description of the Lenta dataset.

  • feature_names (list): Names of the features.

  • target_name (str): Name of the target.

  • treatment_name (str): Name of the treatment.

Tuple:

tuple (data, target, treatment) if return_X_y is True

Return type

Bunch or tuple

References

“A Large Scale Benchmark for Uplift Modeling” Eustache Diemert, Artem Betlei, Christophe Renaudin; (Criteo AI Lab), Massih-Reza Amini (LIG, Grenoble INP)

Criteo Uplift Modeling Dataset

This is a copy of Criteo AI Lab Uplift Prediction dataset.

Data description

This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising.

Fields

Here is a detailed description of the fields (they are comma-separated in the file):

  • f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)

  • treatment: treatment group. Flag if a company participates in the RTB auction for a particular user (binary: 1 = treated, 0 = control)

  • exposure: treatment effect, whether the user has been effectively exposed. Flag if a company wins in the RTB auction for the user (binary)

  • conversion: whether a conversion occured for this user (binary, label)

  • visit: whether a visit occured for this user (binary, label)

Key figures

  • Format: CSV

  • Size: 297M (compressed) 3,2GB (uncompressed)

  • Rows: 13,979,592

  • Average Visit Rate: .046992

  • Average Conversion Rate: .00292

  • Treatment Ratio: .85

This dataset is released along with the paper: “A Large Scale Benchmark for Uplift Modeling” Eustache Diemert, Artem Betlei, Christophe Renaudin; (Criteo AI Lab), Massih-Reza Amini (LIG, Grenoble INP) This work was published in: AdKDD 2018 Workshop, in conjunction with KDD 2018.