sklift.datasets.fetch_hillstrom

sklift.datasets.datasets.fetch_hillstrom(target_col='visit', data_home=None, dest_subdir=None, download_if_missing=True, return_X_y_t=False)[source]

Load and return Kevin Hillstrom Dataset MineThatData (classification or regression).

This dataset contains 64,000 customers who last purchased within twelve months. The customers were involved in an e-mail test.

Major columns:

  • visit (binary): target. 1/0 indicator, 1 = Customer visited website in the following two weeks.

  • conversion (binary): target. 1/0 indicator, 1 = Customer purchased merchandise in the following two weeks.

  • spend (float): target. Actual dollars spent in the following two weeks.

  • segment (str): treatment. The e-mail campaign the customer received

Read more in the docs.

Parameters
  • target_col (string, 'visit' or 'conversion', 'spend' or 'all', default='visit') – Selects which column from dataset will be target

  • data_home (str) – The path to the folder where datasets are stored.

  • dest_subdir (str) – The name of the folder in which the dataset is stored.

  • download_if_missing (bool) – Download the data if not present. Raises an IOError if False and data is missing.

  • return_X_y_t (bool, default=False) – If True, returns (data, target, treatment) instead of a Bunch object.

Returns

dataset.

Bunch:

By default dictionary-like object, with the following attributes:

  • data (DataFrame object): Dataset without target and treatment.

  • target (Series or DataFrame object): Column target by values.

  • treatment (Series object): Column treatment by values.

  • DESCR (str): Description of the Hillstrom dataset.

  • feature_names (list): Names of the features.

  • target_name (str or list): Name of the target.

  • treatment_name (str): Name of the treatment.

Tuple:

tuple (data, target, treatment) if return_X_y is True

Return type

Bunch or tuple

References

https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html

Example:

from sklift.datasets import fetch_hillstrom


dataset = fetch_hillstrom(target_col='visit')
data, target, treatment = dataset.data, dataset.target, dataset.treatment

# alternative option
data, target, treatment = fetch_hillstrom(target_col='visit', return_X_y_t=True)

See also

fetch_lenta(): Load and return the Lenta dataset (classification).

fetch_x5(): Load and return the X5 RetailHero dataset (classification).

fetch_criteo(): Load and return the Criteo Uplift Prediction Dataset (classification).

fetch_megafon(): Load and return the MegaFon Uplift Competition dataset (classification)

Kevin Hillstrom Dataset: MineThatData

Data description

This is a copy of MineThatData E-Mail Analytics And Data Mining Challenge dataset.

This dataset contains 64,000 customers who last purchased within twelve months. The customers were involved in an e-mail test.

  • 1/3 were randomly chosen to receive an e-mail campaign featuring Mens merchandise.

  • 1/3 were randomly chosen to receive an e-mail campaign featuring Womens merchandise.

  • 1/3 were randomly chosen to not receive an e-mail campaign.

During a period of two weeks following the e-mail campaign, results were tracked. Your job is to tell the world if the Mens or Womens e-mail campaign was successful.

Fields

Historical customer attributes at your disposal include:

  • Recency: Months since last purchase.

  • History_Segment: Categorization of dollars spent in the past year.

  • History: Actual dollar value spent in the past year.

  • Mens: 1/0 indicator, 1 = customer purchased Mens merchandise in the past year.

  • Womens: 1/0 indicator, 1 = customer purchased Womens merchandise in the past year.

  • Zip_Code: Classifies zip code as Urban, Suburban, or Rural.

  • Newbie: 1/0 indicator, 1 = New customer in the past twelve months.

  • Channel: Describes the channels the customer purchased from in the past year.

Another variable describes the e-mail campaign the customer received:

  • Segment

    • Mens E-Mail

    • Womens E-Mail

    • No E-Mail

Finally, we have a series of variables describing activity in the two weeks following delivery of the e-mail campaign:

  • Visit: 1/0 indicator, 1 = Customer visited website in the following two weeks.

  • Conversion: 1/0 indicator, 1 = Customer purchased merchandise in the following two weeks.

  • Spend: Actual dollars spent in the following two weeks.

Key figures

  • Format: CSV

  • Size: 433KB (compressed) 4,935KB (uncompressed)

  • Rows: 64,000

  • Response Ratio:

    • Average visit Rate: .15,

    • Average conversion Rate: .009,

    • the values in the spend column are unevenly distributed from 0.0 to 499.0

  • Treatment Ratio: The parts are distributed evenly between the three classes