scikit-uplift

scikit-uplift (sklift) is an uplift modeling python package that provides fast sklearn-style models implementation, evaluation metrics and visualization tools.

The main idea is to provide easy-to-use and fast python package for uplift modeling. It delivers the model interface with the familiar scikit-learn API. One can use any popular estimator (for instance, from the Catboost library).

Uplift modeling estimates a causal effect of treatment and uses it to effectively target customers that are most likely to respond to a marketing campaign.

Use cases for uplift modeling:

  • Target customers in the marketing campaign. Quite useful in promotion of some popular product where there is a big part of customers who make a target action by themself without any influence. By modeling uplift you can find customers who are likely to make the target action (for instance, install an app) only when treated (for instance, received a push).

  • Combine a churn model and an uplift model to offer some bonus to a group of customers who are likely to churn.

  • Select a tiny group of customers in the campaign where a price per customer is high.

Read more about uplift modeling problem in the User Guide.

Articles in russian on habr.com: Part 1 , Part 2 and Part 3.

Why sklift

  • Сomfortable and intuitive scikit-learn-like API;

  • More uplift metrics than you have ever seen in one place! Include brilliants like Area Under Uplift Curve (AUUC) or Area Under Qini Curve (Qini coefficient) with ideal cases;

  • Supporting any estimator compatible with scikit-learn (e.g. Xgboost, LightGBM, Catboost, etc.);

  • All approaches can be used in the sklearn.pipeline. See the example of usage on the Tutorials page;

  • Also metrics are compatible with the classes from sklearn.model_selection. See the example of usage on the Tutorials page;

  • Almost all implemented approaches solve classification and regression problems;

  • Nice and useful viz for analysing a performance model.

The package currently supports the following methods:

  1. Solo Model (aka S-learner or Treatment Dummy, Treatment interaction) approach

  2. Class Transformation (aka Class Variable Transformation or Revert Label) approach

  3. Two Models (aka X-learner, or naïve approach, or difference score method, or double classifier approach) approach, including Dependent Data Representation

And the following metrics:

  1. Uplift@k

  2. Area Under Uplift Curve

  3. Area Under Qini Curve

  4. Weighted average uplift

Project info

Community

Sklift is being actively maintained and welcomes new contributors of all experience levels.

Thanks to all our contributors!

Contributors

If you have any questions, please contact us at team@uplift-modeling.com

Installation

Install the package by the following command from PyPI:

pip install scikit-uplift

Or install from source:

git clone https://github.com/maks-sh/scikit-uplift.git
cd scikit-uplift
python setup.py install

Quick Start

See the RetailHero tutorial notebook (EN Open In Colab1, RU Open In Colab2) for details.

Train and predict your uplift model

Use the intuitive python API to train uplift models with sklift.models.

 1# import approaches
 2from sklift.models import SoloModel, ClassTransformation
 3# import any estimator adheres to scikit-learn conventions.
 4from lightgbm import LGBMClassifier
 5
 6# define models
 7estimator = LGBMClassifier(n_estimators=10)
 8
 9# define metamodel
10slearner = SoloModel(estimator=estimator)
11
12# fit model
13slearner.fit(
14    X=X_tr,
15    y=y_tr,
16    treatment=trmnt_tr,
17)
18
19# predict uplift
20uplift_slearner = slearner.predict(X_val)

Evaluate your uplift model

Uplift model evaluation metrics are available in sklift.metrics.

 1# import metrics to evaluate your model
 2from sklift.metrics import (
 3    uplift_at_k, uplift_auc_score, qini_auc_score, weighted_average_uplift
 4)
 5
 6
 7# Uplift@30%
 8uplift_at_k = uplift_at_k(y_true=y_val, uplift=uplift_slearner,
 9                          treatment=trmnt_val,
10                          strategy='overall', k=0.3)
11
12# Area Under Qini Curve
13qini_coef = qini_auc_score(y_true=y_val, uplift=uplift_slearner,
14                           treatment=trmnt_val)
15
16# Area Under Uplift Curve
17uplift_auc = uplift_auc_score(y_true=y_val, uplift=uplift_slearner,
18                              treatment=trmnt_val)
19
20# Weighted average uplift
21wau = weighted_average_uplift(y_true=y_val, uplift=uplift_slearner,
22                              treatment=trmnt_val)

Vizualize the results

Visualize performance metrics with sklift.viz.

 1from sklift.viz import plot_qini_curve
 2import matplotlib.pyplot as plt
 3
 4fig, ax = plt.subplots(1, 1)
 5ax.set_title('Qini curves')
 6
 7plot_qini_curve(
 8    y_test, uplift_slearner, trmnt_test,
 9    perfect=True, name='Slearner', ax=ax
10);
11
12plot_qini_curve(
13    y_test, uplift_revert, trmnt_test,
14    perfect=False, name='Revert label', ax=ax
15);
Example of some models qini curves, perfect qini curve and random qini curve
 1from sklift.viz import plot_uplift_curve
 2import matplotlib.pyplot as plt
 3
 4fig, ax = plt.subplots(1, 1)
 5ax.set_title('Uplift curves')
 6
 7plot_uplift_curve(
 8    y_test, uplift_slearner, trmnt_test,
 9    perfect=True, name='Slearner', ax=ax
10);
11
12plot_uplift_curve(
13    y_test, uplift_revert, trmnt_test,
14    perfect=False, name='Revert label', ax=ax
15);
Example of some uplift curves, perfect uplift curve and random uplift curve
1from sklift.viz import plot_uplift_by_percentile
2
3plot_uplift_by_percentile(y_true=y_val, uplift=uplift_preds,
4                          treatment=treat_val, kind='bar')
Uplift by percentile visualization

User Guide

Cover of User Guide for uplift modeling and causal inference

Uplift modeling estimates the effect of communication action on some customer outcomes and gives an opportunity to efficiently target customers which are most likely to respond to a marketing campaign. It is relatively easy to implement, but surprisingly poorly covered in the machine learning courses and literature. This guide is going to shed some light on the essentials of causal inference estimating and uplift modeling.

Introduction

Uplift vs other models

Companies use various channels to promote a product to a customer: it can be SMS, push notification, chatbot message in social networks, and many others. There are several ways to use machine learning to select customers for a marketing campaign:

Comparison with other models
  • The Look-alike model (or Positive Unlabeled Learning) evaluates a probability that the customer is going to accomplish a target action. A training dataset contains known positive objects (for instance, users who have installed an app) and random negative objects (a random subset of all other customers who have not installed the app). The model searches for customers who are similar to those who made the target action.

  • The Response model evaluates the probability that the customer is going to accomplish the target action if there was a communication (a.k.a treatment). In this case, the training dataset is data collected after some interaction with the customers. In contrast to the first approach, we have confirmed positive and negative observations at our disposal (for instance, the customer who decides to issue a credit card or to decline an offer).

  • The Uplift model evaluates the net effect of communication by trying to select only those customers who are going to perform the target action only when there is some advertising exposure presenting to them. The model predicts a difference between the customer’s behavior when there is a treatment (communication) and when there is no treatment (no communication).

When should we use uplift modeling?

Uplift modeling is used when the customer’s target action is likely to happen without any communication. For instance, we want to promote a popular product but we don’t want to spend our marketing budget on customers who will buy the product anyway with or without communication. If the product is not popular and it has to be promoted to be bought, then a task turns to the response modeling task.

References

1️⃣ Radcliffe, N.J. (2007). Using control groups to target on predicted lift: Building and assessing uplift model. Direct Market J Direct Market Assoc Anal Council, 1:14–21, 2007.

Causal Inference: Basics

In a perfect world, we want to calculate a difference in a person’s reaction received communication, and the reaction without receiving any communication. But there is a problem: we can not make a communication (send an e-mail) and do not make a communication (no e-mail) at the same time.

Joke about Schrodinger's cat

Denoting \(Y_i^1\) person \(i\)’s outcome when receives the treatment (a presence of the communication) and \(Y_i^0\) \(i\)’s outcome when he receives no treatment (control, no communication), the causal effect \(\tau_i\) of the treatment vis-a-vis no treatment is given by:

\[\tau_i = Y_i^1 - Y_i^0\]

Researchers are typically interested in estimating the Conditional Average Treatment Effect (CATE), that is, the expected causal effect of the treatment for a subgroup in the population:

\[CATE = E[Y_i^1 \vert X_i] - E[Y_i^0 \vert X_i]\]

Where \(X_i\) - features vector describing \(i\)-th person.

We can observe neither causal effect nor CATE for the \(i\)-th object, and, accordingly, we can’t optimize it. But we can estimate CATE or uplift of an object:

\[\textbf{uplift} = \widehat{CATE} = E[Y_i \vert X_i = x, W_i = 1] - E[Y_i \vert X_i = x, W_i = 0]\]

Where:

  • \(W_i \in {0, 1}\) - a binary variable: 1 if person \(i\) receives the treatment group, and 0 if person \(i\) receives no treatment control group;

  • \(Y_i\) - person \(i\)’s observed outcome, which is equal:

\[\begin{split}Y_i = W_i * Y_i^1 + (1 - W_i) * Y_i^0 = \ \begin{cases} Y_i^1, & \mbox{if } W_i = 1 \\ Y_i^0, & \mbox{if } W_i = 0 \\ \end{cases}\end{split}\]

This won’t identify the CATE unless one is willing to assume that \(W_i\) is independent of \(Y_i^1\) and \(Y_i^0\) conditional on \(X_i\). This assumption is the so-called Unconfoundedness Assumption or the Conditional Independence Assumption (CIA) found in the social sciences and medical literature. This assumption holds true when treatment assignment is random conditional on \(X_i\). Briefly, this can be written as:

\[CIA : \{Y_i^0, Y_i^1\} \perp \!\!\! \perp W_i \vert X_i\]

Also, introduce additional useful notation. Let us define the propensity score, \(p(X_i) = P(W_i = 1| X_i)\), i.e. the probability of treatment given \(X_i\).

References

1️⃣ Gutierrez, P., & Gérardy, J. Y. (2017). Causal Inference and Uplift Modelling: A Review of the Literature. In International Conference on Predictive Applications and APIs (pp. 1-13).

Data collection

We need to evaluate a difference between two events that are mutually exclusive for a particular customer (either we communicate with a person, or we don’t; you can’t do both actions at the same time). This is why there are additional requirements for collecting data when building an uplift model.

There are few additional steps different from a standard data collection procedure. You should run an experiment:

  1. Randomly divide a representative part of the customer base into a treatment (receiving communication) and a control (receiving no communication) groups;

  2. Evaluate the marketing experiment for the treatment group.

Data collected from the marketing experiment consists of the customer’s responses to the marketing offer (target).

The only difference between the experiment and the future uplift model’s campaign is a fact that in the first case we choose random customers to make a promotion. In the second case, the choice of a customer to communicate with is based on the predicted value returned by the uplift model. If the marketing campaign significantly differs from the experiment used to collect data, the model will be less accurate.

There is a trick: before running the marketing campaign, it is recommended to randomly subset a small part of the customer base and divide it into a control and a treatment group again, similar to the previous experiment. Using this data, you will not only be able to accurately evaluate the effectiveness of the campaign but also collect additional data for a further model retraining.

Animation: Design of a train data collection experiment for uplift modeling

It is recommended to configure a development of the uplift model and the campaign launch as an iterative process: each iteration will collect new training data. It should consist of a mix of a random customer subset and customers selected by the model.

References

1️⃣ Verbeke, Wouter & Baesens, Bart & Bravo, Cristián. (2018). Profit Driven Business Analytics: A Practitioner’s Guide to Transforming Big Data into Added Value.

Types of customers

We can determine 4 types of customers based on a response to treatment:

Classification of customers based on their response to a treatment
  • Do-Not-Disturbs (a.k.a. Sleeping-dogs) have a strong negative response to marketing communication. They are going to purchase if NOT treated and will NOT purchase IF treated. It is not only a wasted marketing budget but also a negative impact. For instance, customers targeted could result in rejecting current products or services. In terms of math: \(W_i = 1, Y_i = 0\) or \(W_i = 0, Y_i = 1\).

  • Lost Causes will NOT purchase the product NO MATTER they are contacted or not. The marketing budget in this case is also wasted because it has no effect. In terms of math: \(W_i = 1, Y_i = 0\) or \(W_i = 0, Y_i = 0\).

  • Sure Things will purchase ANYWAY no matter they are contacted or not. There is no motivation to spend the budget because it also has no effect. In terms of math: \(W_i = 1, Y_i = 1\) or \(W_i = 0, Y_i = 1\).

  • Persuadables will always respond POSITIVE to marketing communication. They are going to purchase ONLY if contacted (or sometimes they purchase MORE or EARLIER only if contacted). This customer’s type should be the only target for the marketing campaign. In terms of math: \(W_i = 0, Y_i = 0\) or \(W_i = 1, Y_i = 1\).

Because we can’t communicate and not communicate with the customer at the same time, we will never be able to observe exactly which type a particular customer belongs to.

Depends on the product characteristics and the customer base structure some types may be absent. In addition, a customer response depends heavily on various characteristics of the campaign, such as a communication channel or a type and a size of the marketing offer. To maximize profit, these parameters should be selected.

Thus, when predicting uplift score and selecting a segment by the highest score, we are trying to find the only one type: persuadables.

References

1️⃣ Kane, K., V. S. Y. Lo, and J. Zheng. Mining for the Truly Responsive Customers and Prospects Using True-Lift Modeling: Comparison of New and Existing Methods. Journal of Marketing Analytics 2 (4): 218–238. 2014.

2️⃣ Verbeke, Wouter & Baesens, Bart & Bravo, Cristián. (2018). Profit Driven Business Analytics: A Practitioner’s Guide to Transforming Big Data into Added Value.

Models

Approach classification

Uplift modeling techniques can be grouped into data preprocessing and data processing approaches.

Classification of uplift modeling techniques: data preprocessing and data processing
Data preprocessing

In the preprocessing approaches, existing out-of-the-box learning methods are used, after pre- or post-processing of the data and outcomes.

A popular and generic data preprocessing approach is the flipped label approach, also called class transformation approach.

Other data preprocessing approaches extend the set of predictor variables to allow for the estimation of uplift. An example is the single model with treatment as feature.

Data processing

In the data processing approaches, new learning methods and methodologies are developed that aim to optimize expected uplift more directly.

Data processing techniques include two categories: indirect and direct estimation approaches.

Indirect estimation approaches include the two-model model approach.

Direct estimation approaches are typically adaptations from decision tree algorithms. The adoptions include modified the splitting criteria and dedicated pruning techniques.

References

1️⃣ Devriendt, Floris, Tias Guns and Wouter Verbeke. “Learning to rank for uplift modeling.” ArXiv abs/2002.05897 (2020): n. pag.

Single model approaches
Single model with treatment as feature

The most intuitive and simple uplift modeling technique. A training set consists of two groups: treatment samples and control samples. There is also a binary treatment flag added as a feature to the training set. After the model is trained, at the scoring time it is going to be applied twice: with the treatment flag equals 1 and with the treatment flag equals 0. Subtracting these model’s outcomes for each test sample, we will get an estimate of the uplift.

Solo model dummy method

Hint

In sklift this approach corresponds to the SoloModel class and the dummy method.

Treatment interaction

The single model approach has various modifications. For instance, we can update the number of attributes in the training set by adding the product of each attribute and the treatment flag:

Solo model treatment interaction method

Hint

In sklift this approach corresponds to the SoloModel class and the treatment_interaction method.

References

1️⃣ Lo, Victor. (2002). The True Lift Model - A Novel Data Mining Approach to Response Modeling in Database Marketing. SIGKDD Explorations. 4. 78-86.

Examples using sklift.models.SoloModel
  1. The overview of the basic approaches to solving the Uplift Modeling problem

In English 🇬🇧

Open In Colab1

nbviewer

github

In Russian 🇷🇺

Open In Colab2

nbviewer

github

Class Transformation

Warning

This approach is only suitable for classification problem

Simple yet powerful and mathematically proven uplift modeling method, presented in 2012. The main idea is to predict a slightly changed target \(Z_i\):

\[Z_i = Y_i \cdot W_i + (1 - Y_i) \cdot (1 - W_i),\]
  • \(Z_i\) - a new target for the \(i\) customer;

  • \(Y_i\) - a previous target for the \(i\) customer;

  • \(W_i\) - treatment flag assigned to the \(i\) customer.

In other words, the new target equals 1 if a response in the treatment group is as good as a response in the control group and equals 0 otherwise:

\[\begin{split}Z_i = \begin{cases} 1, & \mbox{if } W_i = 1 \mbox{ and } Y_i = 1 \\ 1, & \mbox{if } W_i = 0 \mbox{ and } Y_i = 0 \\ 0, & \mbox{otherwise} \end{cases}\end{split}\]

Let’s go deeper and estimate the conditional probability of the target variable:

\[\begin{split}P(Z=1|X = x) = \\ = P(Z=1|X = x, W = 1) \cdot P(W = 1|X = x) + \\ + P(Z=1|X = x, W = 0) \cdot P(W = 0|X = x) = \\ = P(Y=1|X = x, W = 1) \cdot P(W = 1|X = x) + \\ + P(Y=0|X = x, W = 0) \cdot P(W = 0|X = x).\end{split}\]

We assume that \(W\) is independent of \(X = x\) by design. Thus we have: \(P(W | X = x) = P(W)\) and

\[\begin{split}P(Z=1|X = x) = \\ = P^T(Y=1|X = x) \cdot P(W = 1) + \\ + P^C(Y=0|X = x) \cdot P(W = 0)\end{split}\]

Also, we assume that \(P(W = 1) = P(W = 0) = \frac{1}{2}\), which means that during the experiment the control and the treatment groups were divided in equal proportions. Then we get the following:

\[ \begin{align}\begin{aligned}\begin{split}P(Z=1|X = x) = \\ = P^T(Y=1|X = x) \cdot \frac{1}{2} + P^C(Y=0|X = x) \cdot \frac{1}{2} \Rightarrow \\\end{split}\\\begin{split}2 \cdot P(Z=1|X = x) = \\ = P^T(Y=1|X = x) + P^C(Y=0|X = x) = \\ = P^T(Y=1|X = x) + 1 - P^C(Y=1|X = x) \Rightarrow \\ \Rightarrow P^T(Y=1|X = x) - P^C(Y=1|X = x) = \\ = uplift = 2 \cdot P(Z=1|X = x) - 1\end{split}\end{aligned}\end{align} \]
Mem about class transformation approach for uplift modeling

Thus, by doubling the estimate of the new target \(Z\) and subtracting one we will get an estimation of the uplift:

\[uplift = 2 \cdot P(Z=1) - 1\]

This approach is based on the assumption: \(P(W = 1) = P(W = 0) = \frac{1}{2}\). That is the reason that it has to be used only in cases where the number of treated customers (communication) is equal to the number of control customers (no communication).

Hint

In sklift this approach corresponds to the ClassTransformation class.

References

1️⃣ Maciej Jaskowski and Szymon Jaroszewicz. Uplift modeling for clinical trial data. ICML Workshop on Clinical Data Analysis, 2012.

Examples using sklift.models.ClassTransformation
  1. The overview of the basic approaches to the Uplift Modeling problem

In English 🇬🇧

Open In Colab1

nbviewer

github

In Russian 🇷🇺

Open In Colab2

nbviewer

github

  1. The 2nd place solution of X5 RetailHero uplift contest by Kirill Liksakov

In English 🇬🇧

nbviewer

github

Transformed Outcome

Let’s redefine target variable, which indicates that treatment make some impact on target or did target is negative without treatment:

\[Z = Y * \frac{(W - p)}{(p * (1 - p))}\]
  • \(Y\) - target vector,

  • \(W\) - vector of binary communication flags, and

  • \(p\) is a propensity score (the probabilty that each \(y_i\) is assigned to the treatment group.).

It is important to note here that it is possible to estimate \(p\) as the proportion of objects with \(W = 1\) in the sample. Or use the method from [2], in which it is proposed to evaluate math:p as a function of \(X\) by training the classifier on the available data \(X = x\), and taking the communication flag vector math:W as the target variable.

Transformation of the target in Transformed Outcome approach

After applying the formula, we get a new target variable \(Z_i\) and can train a regression model with the error functional \(MSE= \frac{1}{n}\sum_{i=0}^{n} (Z_i - \hat{Z_i})^2\). Since it is precisely when using MSE that the predictions of the model are the conditional mathematical expectation of the target variable.

It can be proved that the conditional expectation of the transformed target \(Z_i\) is the desired causal effect:

\[E[Z_i| X_i = x] = Y_i^1 - Y_i^0 = \tau_i\]

Hint

In sklift this approach corresponds to the ClassTransformationReg class.

References

1️⃣ Susan Athey and Guido W Imbens. Machine learning methods for estimating heterogeneouscausal effects. stat, 1050:5, 2015.

2️⃣ P. Richard Hahn, Jared S. Murray, and Carlos Carvalho. Bayesian regression tree models for causal inference: regularization, confounding, and heterogeneous effects. 2019.

Two models approaches

The two models approach can be found in almost every uplift modeling research. It is often used as a baseline model.

Two independent models

Hint

In sklift this approach corresponds to the sklift.models.TwoModels class and the vanilla method.

The main idea is to estimate the conditional probabilities of the treatment and control groups separately.

  1. Train the first model using the treatment set.

  2. Train the second model using the control set.

  3. Inference: subtract the control model scores from the treatment model scores.

Two independent models vanilla

The main disadvantage of this method is that if the uplift signal is weak, it can be lost since both models focus on predicting an original response, not the uplift.

Two dependent models

The dependent data representation approach is based on the classifier chain method originally developed for multi-class classification problems. The idea is that if there are \(L\) different labels, you can build \(L\) different classifiers, each of which solves the problem of binary classification and in the learning process, each subsequent classifier uses the predictions of the previous ones as additional features. The authors of this method proposed to use the same idea to solve the problem of uplift modeling in two stages.

Hint

In sklift this approach corresponds to the TwoModels class and the ddr_control method.

At the beginning, we train the classifier based on the control data:

\[P^C = P(Y=1| X, W = 0),\]

Next, we estimate the \(P_C\) predictions and use them as a feature for the second classifier. It effectively reflects a dependency between treatment and control datasets:

\[P^T = P(Y=1| X, P_C(X), W = 1)\]

To get the uplift for each observation, calculate the difference:

\[uplift(x_i) = P^T (x_i, P_C(x_i)) - P^C(x_i)\]

Intuitively, the second classifier learns the difference between the expected probability in the treatment and the control sets which is the uplift.

Two independent models dependent data representation control

Similarly, you can first train the \(P_T\) classifier and then use its predictions as a feature for the \(P_C\) classifier.

Hint

In sklift this approach corresponds to the TwoModels class and the ddr_treatment method.

There is an important remark about the data nature. It is important to calibrate the model’s scores into probabilities if treatment and control data have a different nature. Model calibration techniques are well described in the scikit-learn documentation.

References

1️⃣ Betlei, Artem & Diemert, Eustache & Amini, Massih-Reza. (2018). Uplift Prediction with Dependent Feature Representation in Imbalanced Treatment and Control Conditions: 25th International Conference, ICONIP 2018, Siem Reap, Cambodia, December 13–16, 2018, Proceedings, Part V. 10.1007/978-3-030-04221-9_5.

2️⃣ Zhao, Yan & Fang, Xiao & Simchi-Levi, David. (2017). Uplift Modeling with Multiple Treatments and General Response Types. 10.1137/1.9781611974973.66.

Examples using sklift.models.TwoModels
  1. The overview of the basic approaches to solving the Uplift Modeling problem

In English 🇬🇧

Open In Colab1

nbviewer

github

In Russian 🇷🇺

Open In Colab2

nbviewer

github

Credits

Authors:

Acknowledgements:

Citations

If you find this User Guide useful for your research, please consider citing:

@misc{user-guide-for-uplift-modeling,
  author = {Maksim Shevchenko, Irina Elisova},
  title = {User Guide for uplift modeling and casual inference},
  year = {2020},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://www.uplift-modeling.com/en/latest/user_guide/index.html}}
}

API sklift

This is the modules reference of scikit-uplift.

sklift.models

See Models section of the User Guide for further details.

sklift.models.SoloModel
class sklift.models.models.SoloModel(estimator, method='dummy')[source]

aka Treatment Dummy approach, or Single model approach, or S-Learner.

Fit solo model on whole dataset with ‘treatment’ as an additional feature.

Each object from the test sample is scored twice: with the communication flag equal to 1 and equal to 0. Subtracting the probabilities for each observation, we get the uplift.

Return delta of predictions for each example.

Read more in the User Guide.

Parameters
  • estimator (estimator object implementing 'fit') – The object to use to fit the data.

  • method (string, ’dummy’ or ’treatment_interaction’, default='dummy') –

    Specifies the approach:

    • 'dummy':

      Single model;

    • 'treatment_interaction':

      Single model including treatment interactions.

trmnt_preds_

Estimator predictions on samples when treatment.

Type

array-like, shape (n_samples, )

ctrl_preds_

Estimator predictions on samples when control.

Type

array-like, shape (n_samples, )

Example:

# import approach
from sklift.models import SoloModel
# import any estimator adheres to scikit-learn conventions
from catboost import CatBoostClassifier


sm = SoloModel(CatBoostClassifier(verbose=100, random_state=777))  # define approach
sm = sm.fit(X_train, y_train, treat_train, estimator_fit_params={{'plot': True})  # fit the model
uplift_sm = sm.predict(X_val)  # predict uplift

References

Lo, Victor. (2002). The True Lift Model - A Novel Data Mining Approach to Response Modeling in Database Marketing. SIGKDD Explorations. 4. 78-86.

See also

Other approaches:

Other:

fit(X, y, treatment, estimator_fit_params=None)[source]

Fit the model according to the given training data.

For each test example calculate predictions on new set twice: by the first and second models. After that calculate uplift as a delta between these predictions.

Return delta of predictions for each example.

Parameters
  • X (array-like, shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape (n_samples,)) – Binary target vector relative to X.

  • treatment (array-like, shape (n_samples,)) – Binary treatment vector relative to X.

  • estimator_fit_params (dict, optional) – Parameters to pass to the fit method of the estimator.

Returns

self

Return type

object

predict(X)[source]

Perform uplift on samples in X.

Parameters

X (array-like, shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

Returns

uplift

Return type

array (shape (n_samples,))

sklift.models.ClassTransformation
class sklift.models.models.ClassTransformation(estimator)[source]

aka Class Variable Transformation or Revert Label approach.

Redefine target variable, which indicates that treatment make some impact on target or did target is negative without treatment: Z = Y * W + (1 - Y)(1 - W),

where Y - target vector, W - vector of binary communication flags.

Then, Uplift ~ 2 * (Z == 1) - 1

Returns only uplift predictions.

Read more in the User Guide.

Parameters

estimator (estimator object implementing 'fit') – The object to use to fit the data.

Example:

# import approach
from sklift.models import ClassTransformation
# import any estimator adheres to scikit-learn conventions
from catboost import CatBoostClassifier


# define approach
ct = ClassTransformation(CatBoostClassifier(verbose=100, random_state=777))
# fit the model
ct = ct.fit(X_train, y_train, treat_train, estimator_fit_params={{'plot': True})
# predict uplift
uplift_ct = ct.predict(X_val)

References

Maciej Jaskowski and Szymon Jaroszewicz. Uplift modeling for clinical trial data. ICML Workshop on Clinical Data Analysis, 2012.

See also

Other approaches:

fit(X, y, treatment, estimator_fit_params=None)[source]

Fit the model according to the given training data.

Parameters
  • X (array-like, shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape (n_samples,)) – Target vector relative to X.

  • treatment (array-like, shape (n_samples,)) – Binary treatment vector relative to X.

  • estimator_fit_params (dict, optional) – Parameters to pass to the fit method of the estimator.

Returns

self

Return type

object

predict(X)[source]

Perform uplift on samples in X.

Parameters

X (array-like, shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

Returns

uplift

Return type

array (shape (n_samples,))

sklift.models.ClassTransformationReg
class sklift.models.models.ClassTransformationReg(estimator, propensity_val=None, propensity_estimator=None)[source]

aka CATE-generating (Conditional Average Treatment Effect) Transformation of the Outcome.

Redefine target variable, which indicates that treatment make some impact on target or did target is negative without treatment: Z = Y * (W - p)/(p * (1 - p)),

where Y - target vector, W - vector of binary communication flags, and p is a propensity score (the probabilty that each y_i is assigned to the treatment group.).

Then, train a regressor on Z to predict uplift.

Returns uplift predictions and optionally propensity predictions.

The propensity score can be a scalar value (e.g. p = 0.5), which would mean that every subject has identical probability of being assigned to the treatment group.

Alternatively, the propensity can be learned using a Classifier model. In this case, the model predicts the probability that a given subject would be assigned to the treatment group.

Read more in the User Guide.

Parameters
  • estimator (estimator object implementing 'fit') – The object to use to fit the data.

  • propensity_val (float) – A constant propensity value, which assumes every subject has equal probability of assignment to the treatment group.

  • propensity_estimator (estimator object with predict_proba) – The object used to predict the propensity score if propensity_val is not given.

Example:

# import approach
from sklift.models import ClassTransformationReg
# import any estimator adheres to scikit-learn conventions
from sklearn.linear_model import LinearRegression, LogisticRegression


# define approach
ct = ClassTransformationReg(estimator=LinearRegression(), propensity_estimator=LogisticRegression())
# fit the model
ct = ct.fit(X_train, y_train, treat_train)
# predict uplift
uplift_ct = ct.predict(X_val)

References

Athey, Susan & Imbens, Guido & Ramachandra, Vikas. (2015). Machine Learning Methods for Estimating Heterogeneous Causal Effects.

See also

Other approaches:

fit(X, y, treatment, estimator_fit_params=None)[source]

Fit the model according to the given training data.

Parameters
  • X (array-like, shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape (n_samples,)) – Target vector relative to X.

  • treatment (array-like, shape (n_samples,)) – Binary treatment vector relative to X.

  • estimator_fit_params (dict, optional) – Parameters to pass to the fit method of the estimator.

Returns

self

Return type

object

predict(X)[source]

Perform uplift on samples in X.

Parameters

X (array-like, shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

Returns

uplift

Return type

array (shape (n_samples,))

predict_propensity(X)[source]

Predict propensity values.

Parameters

X (array-like, shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

Returns

propensity

Return type

array (shape (n_samples,))

sklift.models.TwoModels
class sklift.models.models.TwoModels(estimator_trmnt, estimator_ctrl, method='vanilla')[source]

aka naïve approach, or difference score method, or double classifier approach.

Fit two separate models: on the treatment data and on the control data.

Read more in the User Guide.

Parameters
  • estimator_trmnt (estimator object implementing 'fit') – The object to use to fit the treatment data.

  • estimator_ctrl (estimator object implementing 'fit') – The object to use to fit the control data.

  • method (string, 'vanilla', 'ddr_control' or 'ddr_treatment', default='vanilla') –

    Specifies the approach:

    • 'vanilla':

      Two independent models;

    • 'ddr_control':

      Dependent data representation (First train control estimator).

    • 'ddr_treatment':

      Dependent data representation (First train treatment estimator).

trmnt_preds_

Estimator predictions on samples when treatment.

Type

array-like, shape (n_samples, )

ctrl_preds_

Estimator predictions on samples when control.

Type

array-like, shape (n_samples, )

Example:

# import approach
from sklift.models import TwoModels
# import any estimator adheres to scikit-learn conventions
from catboost import CatBoostClassifier


estimator_trmnt = CatBoostClassifier(silent=True, thread_count=2, random_state=42)
estimator_ctrl = CatBoostClassifier(silent=True, thread_count=2, random_state=42)

# define approach
tm_ctrl = TwoModels(
    estimator_trmnt=estimator_trmnt,
    estimator_ctrl=estimator_ctrl,
    method='ddr_control'
)

# fit the models
tm_ctrl = tm_ctrl.fit(
    X_train, y_train, treat_train,
    estimator_trmnt_fit_params={'cat_features': cat_features},
    estimator_ctrl_fit_params={'cat_features': cat_features}
)
uplift_tm_ctrl = tm_ctrl.predict(X_val)  # predict uplift
References

Betlei, Artem & Diemert, Eustache & Amini, Massih-Reza. (2018). Uplift Prediction with Dependent Feature Representation in Imbalanced Treatment and Control Conditions: 25th International Conference, ICONIP 2018, Siem Reap, Cambodia, December 13–16, 2018, Proceedings, Part V. 10.1007/978-3-030-04221-9_5.

Zhao, Yan & Fang, Xiao & Simchi-Levi, David. (2017). Uplift Modeling with Multiple Treatments and General Response Types. 10.1137/1.9781611974973.66.

See also

Other approaches:

Other:

fit(X, y, treatment, estimator_trmnt_fit_params=None, estimator_ctrl_fit_params=None)[source]

Fit the model according to the given training data.

For each test example calculate predictions on new set twice: by the first and second models. After that calculate uplift as a delta between these predictions.

Return delta of predictions for each example.

Parameters
  • X (array-like, shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

  • y (array-like, shape (n_samples,)) – Target vector relative to X.

  • treatment (array-like, shape (n_samples,)) – Binary treatment vector relative to X.

  • estimator_trmnt_fit_params (dict, optional) – Parameters to pass to the fit method of the treatment estimator.

  • estimator_ctrl_fit_params (dict, optional) – Parameters to pass to the fit method of the control estimator.

Returns

self

Return type

object

predict(X)[source]

Perform uplift on samples in X.

Parameters

X (array-like, shape (n_samples, n_features)) – Training vector, where n_samples is the number of samples and n_features is the number of features.

Returns

uplift

Return type

array (shape (n_samples,))

sklift.metrics

sklift.metrics.uplift_at_k
sklift.metrics.metrics.uplift_at_k(y_true, uplift, treatment, strategy, k=0.3)[source]

Compute uplift at first k observations by uplift of the total sample.

Parameters
  • y_true (1d array-like) – Correct (true) binary target values.

  • uplift (1d array-like) – Predicted uplift, as returned by a model.

  • treatment (1d array-like) – Treatment labels.

  • k (float or int) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the computation of uplift. If int, represents the absolute number of samples.

  • strategy (string, ['overall', 'by_group']) –

    Determines the calculating strategy.

    • 'overall':

      The first step is taking the first k observations of all test data ordered by uplift prediction (overall both groups - control and treatment) and conversions in treatment and control groups calculated only on them. Then the difference between these conversions is calculated.

    • 'by_group':

      Separately calculates conversions in top k observations in each group (control and treatment) sorted by uplift predictions. Then the difference between these conversions is calculated

Changed in version 0.1.0:

  • Add supporting absolute values for k parameter

  • Add parameter strategy

Returns

Uplift score at first k observations of the total sample.

Return type

float

See also

uplift_auc_score(): Compute normalized Area Under the Uplift curve from prediction scores.

qini_auc_score(): Compute normalized Area Under the Qini Curve from prediction scores.

sklift.metrics.uplift_curve
sklift.metrics.metrics.uplift_curve(y_true, uplift, treatment)[source]

Compute Uplift curve.

For computing the area under the Uplift Curve, see uplift_auc_score().

Parameters
  • y_true (1d array-like) – Correct (true) binary target values.

  • uplift (1d array-like) – Predicted uplift, as returned by a model.

  • treatment (1d array-like) – Treatment labels.

Returns

Points on a curve.

Return type

array (shape = [>2]), array (shape = [>2])

See also

uplift_auc_score(): Compute normalized Area Under the Uplift curve from prediction scores.

perfect_uplift_curve(): Compute the perfect Uplift curve.

plot_uplift_curve(): Plot Uplift curves from predictions.

qini_curve(): Compute Qini curve.

References

Devriendt, F., Guns, T., & Verbeke, W. (2020). Learning to rank for uplift modeling. ArXiv, abs/2002.05897.

sklift.metrics.perfect_uplift_curve
sklift.metrics.metrics.perfect_uplift_curve(y_true, treatment)[source]

Compute the perfect (optimum) Uplift curve.

This is a function, given points on a curve. For computing the area under the Uplift Curve, see uplift_auc_score().

Parameters
  • y_true (1d array-like) – Correct (true) binary target values.

  • treatment (1d array-like) – Treatment labels.

Returns

Points on a curve.

Return type

array (shape = [>2]), array (shape = [>2])

See also

uplift_curve(): Compute the area under the Qini curve.

uplift_auc_score(): Compute normalized Area Under the Uplift curve from prediction scores.

plot_uplift_curve(): Plot Uplift curves from predictions.

sklift.metrics.uplift_auc_score
sklift.metrics.metrics.uplift_auc_score(y_true, uplift, treatment)[source]

Compute normalized Area Under the Uplift Curve from prediction scores.

By computing the area under the Uplift curve, the curve information is summarized in one number. For binary outcomes the ratio of the actual uplift gains curve above the diagonal to that of the optimum Uplift Curve.

Parameters
  • y_true (1d array-like) – Correct (true) binary target values.

  • uplift (1d array-like) – Predicted uplift, as returned by a model.

  • treatment (1d array-like) – Treatment labels.

Returns

Area Under the Uplift Curve.

Return type

float

See also

uplift_curve(): Compute Uplift curve.

perfect_uplift_curve(): Compute the perfect (optimum) Uplift curve.

plot_uplift_curve(): Plot Uplift curves from predictions.

qini_auc_score(): Compute normalized Area Under the Qini Curve from prediction scores.

sklift.metrics.qini_curve
sklift.metrics.metrics.qini_curve(y_true, uplift, treatment)[source]

Compute Qini curve.

For computing the area under the Qini Curve, see qini_auc_score().

Parameters
  • y_true (1d array-like) – Correct (true) binary target values.

  • uplift (1d array-like) – Predicted uplift, as returned by a model.

  • treatment (1d array-like) – Treatment labels.

Returns

Points on a curve.

Return type

array (shape = [>2]), array (shape = [>2])

See also

uplift_curve(): Compute the area under the Qini curve.

perfect_qini_curve(): Compute the perfect Qini curve.

plot_qini_curves(): Plot Qini curves from predictions..

uplift_curve(): Compute Uplift curve.

References

Nicholas J Radcliffe. (2007). Using control groups to target on predicted lift: Building and assessing uplift model. Direct Marketing Analytics Journal, (3):14–21, 2007.

Devriendt, F., Guns, T., & Verbeke, W. (2020). Learning to rank for uplift modeling. ArXiv, abs/2002.05897.

sklift.metrics.perfect_qini_curve
sklift.metrics.metrics.perfect_qini_curve(y_true, treatment, negative_effect=True)[source]

Compute the perfect (optimum) Qini curve.

For computing the area under the Qini Curve, see qini_auc_score().

Parameters
  • y_true (1d array-like) – Correct (true) binary target values.

  • treatment (1d array-like) – Treatment labels.

  • negative_effect (bool) – If True, optimum Qini Curve contains the negative effects (negative uplift because of campaign). Otherwise, optimum Qini Curve will not contain the negative effects.

Returns

Points on a curve.

Return type

array (shape = [>2]), array (shape = [>2])

See also

qini_curve(): Compute Qini curve.

qini_auc_score(): Compute the area under the Qini curve.

plot_qini_curves(): Plot Qini curves from predictions..

sklift.metrics.qini_auc_score
sklift.metrics.metrics.qini_auc_score(y_true, uplift, treatment, negative_effect=True)[source]

Compute normalized Area Under the Qini curve (aka Qini coefficient) from prediction scores.

By computing the area under the Qini curve, the curve information is summarized in one number. For binary outcomes the ratio of the actual uplift gains curve above the diagonal to that of the optimum Qini curve.

Parameters
  • y_true (1d array-like) – Correct (true) binary target values.

  • uplift (1d array-like) – Predicted uplift, as returned by a model.

  • treatment (1d array-like) – Treatment labels.

  • negative_effect (bool) –

    If True, optimum Qini Curve contains the negative effects (negative uplift because of campaign). Otherwise, optimum Qini Curve will not contain the negative effects.

    New in version 0.2.0.

Returns

Qini coefficient.

Return type

float

See also

qini_curve(): Compute Qini curve.

perfect_qini_curve(): Compute the perfect (optimum) Qini curve.

plot_qini_curves(): Plot Qini curves from predictions..

uplift_auc_score(): Compute normalized Area Under the Uplift curve from prediction scores.

References

Nicholas J Radcliffe. (2007). Using control groups to target on predicted lift: Building and assessing uplift model. Direct Marketing Analytics Journal, (3):14–21, 2007.

sklift.metrics.weighted_average_uplift
sklift.metrics.metrics.weighted_average_uplift(y_true, uplift, treatment, strategy='overall', bins=10)[source]

Weighted average uplift.

It is an average of uplift by percentile. Weights are sizes of the treatment group by percentile.

Parameters
  • y_true (1d array-like) – Correct (true) binary target values.

  • uplift (1d array-like) – Predicted uplift, as returned by a model.

  • treatment (1d array-like) – Treatment labels.

  • strategy (string, ['overall', 'by_group']) –

    Determines the calculating strategy. Default is ‘overall’.

    • 'overall':

      The first step is taking the first k observations of all test data ordered by uplift prediction (overall both groups - control and treatment) and conversions in treatment and control groups calculated only on them. Then the difference between these conversions is calculated.

    • 'by_group':

      Separately calculates conversions in top k observations in each group (control and treatment) sorted by uplift predictions. Then the difference between these conversions is calculated

  • bins (int) – Determines the number of bins (and the relative percentile) in the data. Default is 10.

Returns

Weighted average uplift.

Return type

float

sklift.metrics.uplift_by_percentile
sklift.metrics.metrics.uplift_by_percentile(y_true, uplift, treatment, strategy='overall', bins=10, std=False, total=False, string_percentiles=True)[source]

Compute metrics: uplift, group size, group response rate, standard deviation at each percentile.

Metrics in columns and percentiles in rows of pandas DataFrame:

  • n_treatment, n_control - group sizes.

  • response_rate_treatment, response_rate_control - group response rates.

  • uplift - treatment response rate substract control response rate.

  • std_treatment, std_control - (optional) response rates standard deviation.

  • std_uplift - (optional) uplift standard deviation.

Parameters
  • y_true (1d array-like) – Correct (true) binary target values.

  • uplift (1d array-like) – Predicted uplift, as returned by a model.

  • treatment (1d array-like) – Treatment labels.

  • strategy (string, ['overall', 'by_group']) –

    Determines the calculating strategy. Default is ‘overall’.

    • 'overall':

      The first step is taking the first k observations of all test data ordered by uplift prediction (overall both groups - control and treatment) and conversions in treatment and control groups calculated only on them. Then the difference between these conversions is calculated.

    • 'by_group':

      Separately calculates conversions in top k observations in each group (control and treatment) sorted by uplift predictions. Then the difference between these conversions is calculated

  • std (bool) – If True, add columns with the uplift standard deviation and the response rate standard deviation. Default is False.

  • total (bool) – If True, add the last row with the total values. Default is False. The total uplift computes as a total response rate treatment - a total response rate control. The total response rate is a response rate on the full data amount.

  • bins (int) – Determines the number of bins (and the relative percentile) in the data. Default is 10.

  • string_percentiles (bool) – type of percentiles in the index: float or string. Default is True (string).

Returns

DataFrame where metrics are by columns and percentiles are by rows.

Return type

pandas.DataFrame

sklift.metrics.response_rate_by_percentile
sklift.metrics.metrics.response_rate_by_percentile(y_true, uplift, treatment, group, strategy='overall', bins=10)[source]

Compute response rate (target mean in the control or treatment group) at each percentile.

Parameters
  • y_true (1d array-like) – Correct (true) binary target values.

  • uplift (1d array-like) – Predicted uplift, as returned by a model.

  • treatment (1d array-like) – Treatment labels.

  • group (string, ['treatment', 'control']) –

    Group type for computing response rate: treatment or control.

    • 'treatment':

      Values equal 1 in the treatment column.

    • 'control':

      Values equal 0 in the treatment column.

  • strategy (string, ['overall', 'by_group']) –

    Determines the calculating strategy. Default is ‘overall’.

    • 'overall':

      The first step is taking the first k observations of all test data ordered by uplift prediction (overall both groups - control and treatment) and conversions in treatment and control groups calculated only on them. Then the difference between these conversions is calculated.

    • 'by_group':

      Separately calculates conversions in top k observations in each group (control and treatment) sorted by uplift predictions. Then the difference between these conversions is calculated.

  • bins (int) – Determines the number of bins (and relative percentile) in the data. Default is 10.

Returns

response rate at each percentile for control or treatment group, variance of the response rate at each percentile, group size at each percentile.

Return type

array (shape = [>2]), array (shape = [>2]), array (shape = [>2])

sklift.metrics.treatment_balance_curve
sklift.metrics.metrics.treatment_balance_curve(uplift, treatment, winsize)[source]

Compute the treatment balance curve: proportion of treatment group in the ordered predictions.

Parameters
  • uplift (1d array-like) – Predicted uplift, as returned by a model.

  • treatment (1d array-like) – Treatment labels.

  • winsize (int) – Size of the sliding window for calculating the balance between treatment and control.

Returns

Points on a curve.

Return type

array (shape = [>2]), array (shape = [>2])

sklift.metrics.average_squared_deviation
sklift.metrics.metrics.average_squared_deviation(y_true_train, uplift_train, treatment_train, y_true_val, uplift_val, treatment_val, strategy='overall', bins=10)[source]

Compute the average squared deviation.

The average squared deviation (ASD) is a model stability metric that shows how much the model overfits the training data. Larger values of ASD mean greater overfit.

Parameters
  • y_true_train (1d array-like) – Correct (true) target values for training set.

  • uplift_train (1d array-like) – Predicted uplift for training set, as returned by a model.

  • treatment_train (1d array-like) – Treatment labels for training set.

  • y_true_val (1d array-like) – Correct (true) target values for validation set.

  • uplift_val (1d array-like) – Predicted uplift for validation set, as returned by a model.

  • treatment_val (1d array-like) – Treatment labels for validation set.

  • strategy (string, ['overall', 'by_group']) –

    Determines the calculating strategy. Default is ‘overall’.

    • 'overall':

      The first step is taking the first k observations of all test data ordered by uplift prediction (overall both groups - control and treatment) and conversions in treatment and control groups calculated only on them. Then the difference between these conversions is calculated.

    • 'by_group':

      Separately calculates conversions in top k observations in each group (control and treatment) sorted by uplift predictions. Then the difference between these conversions is calculated

  • bins (int) – Determines the number of bins (and the relative percentile) in the data. Default is 10.

Returns

average squared deviation

Return type

float

References

René Michel, Igor Schnakenburg, Tobias von Martens. Targeting Uplift. An Introduction to Net Scores.

sklift.metrics.max_prof_uplift
sklift.metrics.metrics.max_prof_uplift(df_sorted, treatment_name, churn_name, pos_outcome, benefit, c_incentive, c_contact, a_cost=0)[source]

Compute the maximum profit generated from an uplift model decided campaign

This can be visualised by plotting plt.plot(perc, cumulative_profit)

Parameters
  • df_sorted (pandas dataframe) – dataframe with descending uplift predictions for each customer (i.e. highest 1st)

  • treatment_name (string) – column name of treatment columm (assuming 1 = treated)

  • churn_name (string) – column name of churn column

  • pos_outcome (int or float) – 1 or 0 value in churn column indicating a positive outcome (i.e. purchase = 1, whereas churn = 0)

  • benefit (int or float) – the benefit of retaining a customer (e.g., the average customer lifetime value)

  • c_incentive (int or float) – the cost of the incentive if a customer accepts the offer

  • c_contact (int or float) – the cost of contacting a customer regardless of conversion

  • a_cost (int or float) – the fixed administration cost for the campaign

Returns

the incremental increase in x, for plotting 1d array-like: the cumulative profit per customer

Return type

1d array-like

References

Floris Devriendt, Jeroen Berrevoets, Wouter Verbeke. Why you should stop predicting customer churn and start using uplift models.

sklift.metrics.make_uplift_scorer
sklift.metrics.metrics.make_uplift_scorer(metric_name, treatment, **kwargs)[source]

Make uplift scorer which can be used with the same API as sklearn.metrics.make_scorer.

Parameters
  • metric_name (string) – Name of desirable uplift metric. Raise ValueError if invalid.

  • treatment (pandas.Series) – A Series from original DataFrame which contains original index and treatment group column.

  • kwargs (additional arguments) – Additional parameters to be passed to metric func. For example: negative_effect, strategy, k or somtething else.

Returns

An uplift scorer with passed treatment variable (and kwargs, optionally) that returns a scalar score.

Return type

scorer (callable)

Raises
  • ValueError – if metric_name does not present in metrics list.

  • ValueError – if treatment is not a pandas Series.

Example:

from sklearn.model_selection import cross_validate
from sklift.metrics import make_uplift_scorer

# define X_cv, y_cv, trmnt_cv and estimator

# Use make_uplift_scorer to initialize new `sklearn.metrics.make_scorer` object
qini_scorer = make_uplift_scorer("qini_auc_score", trmnt_cv)
# or pass additional parameters if necessary
uplift50_scorer = make_uplift_scorer("uplift_at_k", trmnt_cv, strategy='overall', k=0.5)

# Use this object in model selection functions
cross_validate(estimator,
   X=X_cv,
   y=y_cv,
   fit_params={'treatment': trmnt_cv}
   scoring=qini_scorer,
)

sklift.viz

sklift.viz.plot_uplift_preds
sklift.viz.base.plot_uplift_preds(trmnt_preds, ctrl_preds, log=False, bins=100)[source]

Plot histograms of treatment, control and uplift predictions.

Parameters
  • trmnt_preds (1d array-like) – Predictions for all observations if they are treatment.

  • ctrl_preds (1d array-like) – Predictions for all observations if they are control.

  • log (bool) – Logarithm of source samples. Default is False.

  • bins (integer or sequence) – Number of histogram bins to be used. Default is 100. If an integer is given, bins + 1 bin edges are calculated and returned. If bins is a sequence, gives bin edges, including left edge of first bin and right edge of last bin. In this case, bins is returned unmodified. Default is 100.

Returns

Object that stores computed values.

sklift.viz.plot_qini_curve
sklift.viz.base.plot_qini_curve(y_true, uplift, treatment, random=True, perfect=True, negative_effect=True, ax=None, name=None, **kwargs)[source]

Plot Qini curves from predictions.

Parameters
  • y_true (1d array-like) – Ground truth (correct) binary labels.

  • uplift (1d array-like) – Predicted uplift, as returned by a model.

  • treatment (1d array-like) – Treatment labels.

  • random (bool) – Draw a random curve. Default is True.

  • perfect (bool) – Draw a perfect curve. Default is True.

  • negative_effect (bool) – If True, optimum Qini Curve contains the negative effects (negative uplift because of campaign). Otherwise, optimum Qini Curve will not contain the negative effects. Default is True.

  • ax (object) – The graph on which the function will be built. Default is None.

  • name (string) – The name of the function. Default is None.

Returns

Object that stores computed values.

Example:

from sklift.viz import plot_qini_curve


qini_disp = plot_qini_curve(
    y_test, uplift_predicted, trmnt_test,
    perfect=True, name='Model name'
);

qini_disp.figure_.suptitle("Qini curve");
sklift.viz.plot_uplift_curve
sklift.viz.base.plot_uplift_curve(y_true, uplift, treatment, random=True, perfect=True, ax=None, name=None, **kwargs)[source]

Plot Uplift curves from predictions.

Parameters
  • y_true (1d array-like) – Ground truth (correct) binary labels.

  • uplift (1d array-like) – Predicted uplift, as returned by a model.

  • treatment (1d array-like) – Treatment labels.

  • random (bool) – Draw a random curve. Default is True.

  • perfect (bool) – Draw a perfect curve. Default is True.

  • ax (object) – The graph on which the function will be built. Default is None.

  • name (string) – The name of the function. Default is None.

Returns

Object that stores computed values.

Example:

from sklift.viz import plot_uplift_curve


uplift_disp = plot_uplift_curve(
    y_test, uplift_predicted, trmnt_test,
    perfect=True, name='Model name'
);

uplift_disp.figure_.suptitle("Uplift curve");
sklift.viz.plot_treatment_balance_curve
sklift.viz.base.plot_treatment_balance_curve(uplift, treatment, random=True, winsize=0.1)[source]

Plot Treatment Balance curve.

Parameters
  • uplift (1d array-like) – Predicted uplift, as returned by a model.

  • treatment (1d array-like) – Treatment labels.

  • random (bool) – Draw a random curve. Default is True.

  • winsize (float) – Size of the sliding window to apply. Should be between 0 and 1, extremes excluded. Default is 0.1.

Returns

Object that stores computed values.

sklift.viz.plot_uplift_by_percentile
sklift.viz.base.plot_uplift_by_percentile(y_true, uplift, treatment, strategy='overall', kind='line', bins=10, string_percentiles=True)[source]

Plot uplift score, treatment response rate and control response rate at each percentile.

Treatment response rate ia a target mean in the treatment group. Control response rate is a target mean in the control group. Uplift score is a difference between treatment response rate and control response rate.

Parameters
  • y_true (1d array-like) – Correct (true) binary target values.

  • uplift (1d array-like) – Predicted uplift, as returned by a model.

  • treatment (1d array-like) – Treatment labels.

  • strategy (string, ['overall', 'by_group']) –

    Determines the calculating strategy. Default is ‘overall’.

    • 'overall':

      The first step is taking the first k observations of all test data ordered by uplift prediction (overall both groups - control and treatment) and conversions in treatment and control groups calculated only on them. Then the difference between these conversions is calculated.

    • 'by_group':

      Separately calculates conversions in top k observations in each group (control and treatment) sorted by uplift predictions. Then the difference between these conversions is calculated.

  • kind (string, ['line', 'bar']) –

    The type of plot to draw. Default is ‘line’.

    • 'line':

      Generates a line plot.

    • 'bar':

      Generates a traditional bar-style plot.

  • bins (int) – Determines а number of bins (and the relative percentile) in the test data. Default is 10.

  • string_percentiles (bool) – type of xticks: float or string to plot. Default is True (string).

Returns

Object that stores computed values.

sklift.datasets

sklift.datasets.clear_data_dir
sklift.datasets.datasets.clear_data_dir(path=None)[source]

Delete all the content of the data home cache.

Parameters

path (str) – The path to scikit-uplift data dir

sklift.datasets.get_data_dir
sklift.datasets.datasets.get_data_dir()[source]

Return the path of the scikit-uplift data dir.

This folder is used by some large dataset loaders to avoid downloading the data several times.

By default the data dir is set to a folder named scikit-uplift-data in the user home folder.

Returns

The path to scikit-uplift data dir.

Return type

string

sklift.datasets.fetch_lenta
sklift.datasets.datasets.fetch_lenta(data_home=None, dest_subdir=None, download_if_missing=True, return_X_y_t=False)[source]

Load and return the Lenta dataset (classification).

An uplift modeling dataset containing data about Lenta’s customers grociery shopping and related marketing campaigns.

Major columns:

  • group (str): treatment/control group flag

  • response_att (binary): target

  • gender (str): customer gender

  • age (float): customer age

  • main_format (int): store type (1 - grociery store, 0 - superstore)

Read more in the docs.

Parameters
  • data_home (str) – The path to the folder where datasets are stored.

  • dest_subdir (str) – The name of the folder in which the dataset is stored.

  • download_if_missing (bool) – Download the data if not present. Raises an IOError if False and data is missing.

  • return_X_y_t (bool) – If True, returns (data, target, treatment) instead of a Bunch object.

Returns

dataset.

Bunch:

By default dictionary-like object, with the following attributes:

  • data (DataFrame object): Dataset without target and treatment.

  • target (Series object): Column target by values.

  • treatment (Series object): Column treatment by values.

  • DESCR (str): Description of the Lenta dataset.

  • feature_names (list): Names of the features.

  • target_name (str): Name of the target.

  • treatment_name (str): Name of the treatment.

Tuple:

tuple (data, target, treatment) if return_X_y_t is True

Return type

Bunch or tuple

Example:

from sklift.datasets import fetch_lenta


dataset = fetch_lenta()
data, target, treatment = dataset.data, dataset.target, dataset.treatment

# alternative option
data, target, treatment = fetch_lenta(return_X_y_t=True)

See also

fetch_x5(): Load and return the X5 RetailHero dataset (classification).

fetch_criteo(): Load and return the Criteo Uplift Prediction Dataset (classification).

fetch_hillstrom(): Load and return Kevin Hillstrom Dataset MineThatData (classification or regression).

fetch_megafon(): Load and return the MegaFon Uplift Competition dataset (classification).

Lenta Uplift Modeling Dataset
Data description

An uplift modeling dataset containing data about Lenta’s customers grociery shopping and related marketing campaigns.

Source: BigTarget Hackathon hosted by Lenta and Microsoft in summer 2020.

Fields

Major features:

  • group (str): treatment/control group flag

  • response_att (binary): target

  • gender (str): customer gender

  • age (float): customer age

  • main_format (int): store type (1 - grociery store, 0 - superstore)

Feature

Description

CardHolder

customer id

customer

age

children

number of children

cheque_count_[3,6,12]m_g*

number of customer receipts collected within last 3, 6, 12 months before campaign. g* is a product group

crazy_purchases_cheque_count_[1,3,6,12]m

number of customer receipts with items purchased on “crazy” marketing campaign collected within last 1, 3, 6, 12 months before campaign

crazy_purchases_goods_count_[6,12]m

items amount purchased on “crazy” marketing campaign collected within last 6, 12 months before campaign

disc_sum_6m_g34

discount sum for past 6 month on a 34 product group

food_share_[15d,1m]

food share in customer purchases for 15 days, 1 month

gender

customer gender

group

treatment/control group flag

k_var_cheque_[15d,3m]

average check coefficient of variation for 15 days, 3 months

k_var_cheque_category_width_15d

coefficient of variation of the average number of purchased categories (2nd level of the hierarchy) in one receipt for 15 days

k_var_cheque_group_width_15d

coefficient of variation of the average number of purchased groups (1st level of the hierarchy) in one receipt for 15 days

k_var_count_per_cheque_[15d,1m,3m,6m]_g*

unique product id (SKU) coefficient of variation for 15 days, 1, 3 ,6 months for g* product group

k_var_days_between_visits_[15d,1m,3m]

coefficient of variation of the average period between visits for 15 days, 1 month, 3 months

k_var_disc_per_cheque_15d

discount sum coefficient of variation for 15 days

k_var_disc_share_[15d,1m,3m,6m,12m]_g*

discount amount coefficient of variation for 15 days, 1 month, 3 months, 6 months, 12 months for g* product group

k_var_discount_depth_[15d,1m]

discount amount coefficient of variation for 15 days, 1 month

k_var_sku_per_cheque_15d

number of unique product ids (SKU) coefficient of variation for 15 days

k_var_sku_price_12m_g*

price coefficient of variation for 15 days, 3, 6, 12 months for g* product group

main_format

store type (1 - grociery store, 0 - superstore)

mean_discount_depth_15d

mean discount depth for 15 days

months_from_register

number of months from a moment of register

perdelta_days_between_visits_15_30d

timdelta in percent between visits during the first half of the month and visits during second half of the month

promo_share_15d

promo goods share in the customer bucket

response_att

binary target variable = store visit

response_sms

share of customer responses to previous SMS. Response = store visit

response_viber

share of responses to previous Viber messages. Response = store visit

sale_count_[3,6,12]m_g*

number of purchased items from the group * for 3, 6, 12 months

sale_sum_[3,6,12]m_g*

sum of sales from the group * for 3, 6, 12 months

stdev_days_between_visits_15d

coefficient of variation of the days between visits for 15 days

stdev_discount_depth_[15d,1m]

discount sum coefficient of variation for 15 days, 1 month

Key figures
  • Format: CSV

  • Size: 153M (compressed) 567M (uncompressed)

  • Rows: 687,029

  • Response Ratio: .1

  • Treatment Ratio: .75

About Lenta
https://upload.wikimedia.org/wikipedia/commons/7/73/Lenta_logo.svg

Lenta (Russian: Лентa) is a Russian super - and hypermarket chain. With 149 locations across the country, it is one of Russia’s largest retail chains in addition to being the country’s second largest hypermarket chain.

Link to the company’s website: https://www.lenta.com/

sklift.datasets.fetch_x5
sklift.datasets.datasets.fetch_x5(data_home=None, dest_subdir=None, download_if_missing=True)[source]

Load and return the X5 RetailHero dataset (classification).

The dataset contains raw retail customer purchases, raw information about products and general info about customers.

Major columns:

  • treatment_flg (binary): treatment/control group flag

  • target (binary): target

  • customer_id (str): customer id - primary key for joining

Read more in the docs.

Parameters
  • data_home (str, unicode) – The path to the folder where datasets are stored.

  • dest_subdir (str, unicode) – The name of the folder in which the dataset is stored.

  • download_if_missing (bool) – Download the data if not present. Raises an IOError if False and data is missing

Returns

dataset.

Dictionary-like object, with the following attributes.

  • data (Bunch object): dictionary-like object without target and treatment:

    • clients (ndarray or DataFrame object): General info about clients.

    • train (ndarray or DataFrame object): A subset of clients for training.

    • purchases (ndarray or DataFrame object): clients’ purchase history prior to communication.

  • target (Series object): Column target by values.

  • treatment (Series object): Column treatment by values.

  • DESCR (str): Description of the X5 dataset.

  • feature_names (Bunch object): Names of the features.

  • target_name (str): Name of the target.

  • treatment_name (str): Name of the treatment.

Return type

Bunch

References

https://ods.ai/competitions/x5-retailhero-uplift-modeling/data

Example:

from sklift.datasets import fetch_x5


dataset = fetch_x5()
data, target, treatment = dataset.data, dataset.target, dataset.treatment

# data - dictionary-like object
# data contains general info about clients:
clients = data.clients

# data contains a subset of clients for training:
train = data.train

# data contains a clients’ purchase history prior to communication.
purchases = data.purchases

See also

fetch_lenta(): Load and return the Lenta dataset (classification).

fetch_criteo(): Load and return the Criteo Uplift Prediction Dataset (classification).

fetch_hillstrom(): Load and return Kevin Hillstrom Dataset MineThatData (classification or regression).

fetch_megafon(): Load and return the MegaFon Uplift Competition dataset (classification).

X5 RetailHero Uplift Modeling Dataset

The dataset is provided by X5 Retail Group at the RetailHero hackaton hosted in winter 2019.

The dataset contains raw retail customer purchases, raw information about products and general info about customers.

Machine learning competition website.

Data description

Data contains several parts:

  • train.csv: a subset of clients for training. The column treatment_flg indicates if there was a communication. The column target shows if there was a purchase afterward;

  • clients.csv: general info about clients;

  • purchases.csv: clients’ purchase history prior to communication.

X5 table schema
Fields
  • treatment_flg (binary): information on performed communication

  • target (binary): customer purchasing

Key figures
  • Format: CSV

  • Size: 647M (compressed) 4.17GB (uncompressed)

  • Rows:

    • in ‘clients.csv’: 400,162

    • in ‘purchases.csv’: 45,786,568

    • in ‘uplift_train.csv’: 200,039

  • Response Ratio: .62

  • Treatment Ratio: .5

About X5
https://upload.wikimedia.org/wikipedia/en/8/83/X5_Retail_Group_logo_2015.png

X5 Group is a leading Russian food retailer. The Company operates several retail formats: proximity stores under the Pyaterochka brand, supermarkets under the Perekrestok brand and hypermarkets under the Karusel brand, as well as the Perekrestok.ru online market, the 5Post parcel and Dostavka.Pyaterochka and Perekrestok. Bystro food delivery services.

Link to the company’s website: https://www.x5.ru/

sklift.datasets.fetch_criteo
sklift.datasets.datasets.fetch_criteo(target_col='visit', treatment_col='treatment', data_home=None, dest_subdir=None, download_if_missing=True, percent10=False, return_X_y_t=False)[source]

Load and return the Criteo Uplift Prediction Dataset (classification).

This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising.

Major columns:

  • treatment (binary): treatment

  • exposure (binary): treatment

  • visit (binary): target

  • conversion (binary): target

  • f0, ... , f11 (float): feature values

Read more in the docs.

Parameters
  • target_col (string, 'visit', 'conversion' or 'all', default='visit') – Selects which column from dataset will be target. If ‘all’, return a DataFrame with all targets cols.

  • treatment_col (string,'treatment', 'exposure' or 'all', default='treatment') – Selects which column from dataset will be treatment. If ‘all’, return a DataFrame with all treatment cols.

  • data_home (string) – Specify a download and cache folder for the datasets.

  • dest_subdir (string) – The name of the folder in which the dataset is stored.

  • download_if_missing (bool, default=True) – If False, raise an IOError if the data is not locally available instead of trying to download the data from the source site.

  • percent10 (bool, default=False) – Whether to load only 10 percent of the data.

  • return_X_y_t (bool, default=False) – If True, returns (data, target, treatment) instead of a Bunch object.

Returns

dataset.

Bunch:

By default dictionary-like object, with the following attributes:

  • data (DataFrame object): Dataset without target and treatment.

  • target (Series or DataFrame object): Column target by values.

  • treatment (Series or DataFrame object): Column treatment by values.

  • DESCR (str): Description of the Criteo dataset.

  • feature_names (list): Names of the features.

  • target_name (str list): Name of the target.

  • treatment_name (str or list): Name of the treatment.

Tuple:

tuple (data, target, treatment) if return_X_y is True

Return type

Bunch or tuple

Example:

from sklift.datasets import fetch_criteo


dataset = fetch_criteo(target_col='conversion', treatment_col='exposure')
data, target, treatment = dataset.data, dataset.target, dataset.treatment

# alternative option
data, target, treatment = fetch_criteo(target_col='conversion', treatment_col='exposure', return_X_y_t=True)

References

Diemert Eustache, Betlei Artem et al. [2018]

DiemertEustacheBArtemRMR18

Diemert Eustache, Betlei Artem, Christophe Renaudin, and Amini Massih-Reza. A large scale benchmark for uplift modeling. In Proceedings of the AdKDD and TargetAd Workshop, KDD, London,United Kingdom, August, 20, 2018. ACM, 2018.

See also

fetch_lenta(): Load and return the Lenta dataset (classification).

fetch_x5(): Load and return the X5 RetailHero dataset (classification).

fetch_hillstrom(): Load and return Kevin Hillstrom Dataset MineThatData (classification or regression).

fetch_megafon(): Load and return the MegaFon Uplift Competition dataset (classification).

Criteo Uplift Modeling Dataset

This is a copy of Criteo AI Lab Uplift Prediction dataset.

Data description

This dataset is constructed by assembling data resulting from several incrementality tests, a particular randomized trial procedure where a random part of the population is prevented from being targeted by advertising.

Fields

Here is a detailed description of the fields (they are comma-separated in the file):

  • f0, f1, f2, f3, f4, f5, f6, f7, f8, f9, f10, f11: feature values (dense, float)

  • treatment: treatment group. Flag if a company participates in the RTB auction for a particular user (binary: 1 = treated, 0 = control)

  • exposure: treatment effect, whether the user has been effectively exposed. Flag if a company wins in the RTB auction for the user (binary)

  • conversion: whether a conversion occured for this user (binary, label)

  • visit: whether a visit occured for this user (binary, label)

Key figures
  • Format: CSV

  • Size: 297M (compressed) 3,2GB (uncompressed)

  • Rows: 13,979,592

  • Response Ratio:

    • Average Visit Rate: .046992

    • Average Conversion Rate: .00292

  • Treatment Ratio: .85

This dataset is released along with the paper: “A Large Scale Benchmark for Uplift Modeling” Eustache Diemert, Artem Betlei, Christophe Renaudin; (Criteo AI Lab), Massih-Reza Amini (LIG, Grenoble INP) This work was published in: AdKDD 2018 Workshop, in conjunction with KDD 2018.

About Criteo
https://upload.wikimedia.org/wikipedia/commons/d/d2/Criteo_logo21.svg

Criteo is an advertising company that provides online display advertisements. The company was founded and is headquartered in Paris, France. Criteo’s product is a form of display advertising, which displays interactive banner advertisements, generated based on the online browsing preferences and behaviour for each customer. The solution operates on a pay per click/cost per click (CPC) basis.

Link to the company’s website: https://www.criteo.com/

sklift.datasets.fetch_hillstrom
sklift.datasets.datasets.fetch_hillstrom(target_col='visit', data_home=None, dest_subdir=None, download_if_missing=True, return_X_y_t=False)[source]

Load and return Kevin Hillstrom Dataset MineThatData (classification or regression).

This dataset contains 64,000 customers who last purchased within twelve months. The customers were involved in an e-mail test.

Major columns:

  • visit (binary): target. 1/0 indicator, 1 = Customer visited website in the following two weeks.

  • conversion (binary): target. 1/0 indicator, 1 = Customer purchased merchandise in the following two weeks.

  • spend (float): target. Actual dollars spent in the following two weeks.

  • segment (str): treatment. The e-mail campaign the customer received

Read more in the docs.

Parameters
  • target_col (string, 'visit' or 'conversion', 'spend' or 'all', default='visit') – Selects which column from dataset will be target

  • data_home (str) – The path to the folder where datasets are stored.

  • dest_subdir (str) – The name of the folder in which the dataset is stored.

  • download_if_missing (bool) – Download the data if not present. Raises an IOError if False and data is missing.

  • return_X_y_t (bool, default=False) – If True, returns (data, target, treatment) instead of a Bunch object.

Returns

dataset.

Bunch:

By default dictionary-like object, with the following attributes:

  • data (DataFrame object): Dataset without target and treatment.

  • target (Series or DataFrame object): Column target by values.

  • treatment (Series object): Column treatment by values.

  • DESCR (str): Description of the Hillstrom dataset.

  • feature_names (list): Names of the features.

  • target_name (str or list): Name of the target.

  • treatment_name (str): Name of the treatment.

Tuple:

tuple (data, target, treatment) if return_X_y is True

Return type

Bunch or tuple

References

https://blog.minethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html

Example:

from sklift.datasets import fetch_hillstrom


dataset = fetch_hillstrom(target_col='visit')
data, target, treatment = dataset.data, dataset.target, dataset.treatment

# alternative option
data, target, treatment = fetch_hillstrom(target_col='visit', return_X_y_t=True)

See also

fetch_lenta(): Load and return the Lenta dataset (classification).

fetch_x5(): Load and return the X5 RetailHero dataset (classification).

fetch_criteo(): Load and return the Criteo Uplift Prediction Dataset (classification).

fetch_megafon(): Load and return the MegaFon Uplift Competition dataset (classification)

Kevin Hillstrom Dataset: MineThatData
Data description

This is a copy of MineThatData E-Mail Analytics And Data Mining Challenge dataset.

This dataset contains 64,000 customers who last purchased within twelve months. The customers were involved in an e-mail test.

  • 1/3 were randomly chosen to receive an e-mail campaign featuring Mens merchandise.

  • 1/3 were randomly chosen to receive an e-mail campaign featuring Womens merchandise.

  • 1/3 were randomly chosen to not receive an e-mail campaign.

During a period of two weeks following the e-mail campaign, results were tracked. Your job is to tell the world if the Mens or Womens e-mail campaign was successful.

Fields

Historical customer attributes at your disposal include:

  • Recency: Months since last purchase.

  • History_Segment: Categorization of dollars spent in the past year.

  • History: Actual dollar value spent in the past year.

  • Mens: 1/0 indicator, 1 = customer purchased Mens merchandise in the past year.

  • Womens: 1/0 indicator, 1 = customer purchased Womens merchandise in the past year.

  • Zip_Code: Classifies zip code as Urban, Suburban, or Rural.

  • Newbie: 1/0 indicator, 1 = New customer in the past twelve months.

  • Channel: Describes the channels the customer purchased from in the past year.

Another variable describes the e-mail campaign the customer received:

  • Segment

    • Mens E-Mail

    • Womens E-Mail

    • No E-Mail

Finally, we have a series of variables describing activity in the two weeks following delivery of the e-mail campaign:

  • Visit: 1/0 indicator, 1 = Customer visited website in the following two weeks.

  • Conversion: 1/0 indicator, 1 = Customer purchased merchandise in the following two weeks.

  • Spend: Actual dollars spent in the following two weeks.

Key figures
  • Format: CSV

  • Size: 433KB (compressed) 4,935KB (uncompressed)

  • Rows: 64,000

  • Response Ratio:

    • Average visit Rate: .15,

    • Average conversion Rate: .009,

    • the values in the spend column are unevenly distributed from 0.0 to 499.0

  • Treatment Ratio: The parts are distributed evenly between the three classes

About Hillstrom

The dataset was provided by Kevin Hillstorm. Kevin is President of MineThatData, a consultancy that helps CEOs understand the complex relationship between Customers, Advertising, Products, Brands, and Channels.

Link to the blog: https://blog.minethatdata.com/

sklift.datasets.fetch_megafon
sklift.datasets.datasets.fetch_megafon(data_home=None, dest_subdir=None, download_if_missing=True, return_X_y_t=False)[source]

Load and return the MegaFon Uplift Competition dataset (classification).

An uplift modeling dataset containing synthetic data generated by telecom companies, trying to bring them closer to the real case that they encountered.

Major columns:

  • X_1...X_50 : anonymized feature set

  • conversion (binary): target

  • treatment_group (str): customer purchasing

Read more in the docs.

Parameters
  • data_home (str) – The path to the folder where datasets are stored.

  • dest_subdir (str) – The name of the folder in which the dataset is stored.

  • download_if_missing (bool) – Download the data if not present. Raises an IOError if False and data is missing.

  • return_X_y_t (bool) – If True, returns (data, target, treatment) instead of a Bunch object.

Returns

dataset.

Bunch:

By default dictionary-like object, with the following attributes:

  • data (DataFrame object): Dataset without target and treatment.

  • target (Series object): Column target by values.

  • treatment (Series object): Column treatment by values.

  • DESCR (str): Description of the Megafon dataset.

  • feature_names (list): Names of the features.

  • target_name (str): Name of the target.

  • treatment_name (str): Name of the treatment.

Tuple:

tuple (data, target, treatment) if return_X_y is True

Return type

Bunch or tuple

Example:

from sklift.datasets import fetch_megafon


dataset = fetch_megafon()
data, target, treatment = dataset.data, dataset.target, dataset.treatment

# alternative option
data, target, treatment = fetch_megafon(return_X_y_t=True)

See also

fetch_lenta(): Load and return the Lenta dataset (classification).

fetch_x5(): Load and return the X5 RetailHero dataset (classification).

fetch_criteo(): Load and return the Criteo Uplift Prediction Dataset (classification).

fetch_hillstrom(): Load and return Kevin Hillstrom Dataset MineThatData (classification or regression).

MegaFon Uplift Competition Dataset

Machine learning competition website.

Data description

The dataset is provided by MegaFon at the MegaFon Uplift Competition hosted in may 2021.

The dataset contains generated synthetic data, trying to bring them closer to the real case that they encountered.

Fields
  • X_1…X_50: anonymized feature set

  • treatment_group (str): treatment/control group flag

  • conversion (binary): customer purchasing

Key figures
  • Format: CSV

  • Size: 554M

  • Rows: 600,000

  • Response Ratio: .2

  • Treatment Ratio: .5

About MegaFon
https://upload.wikimedia.org/wikipedia/commons/9/9e/MegaFon_logo.svg

MegaFon (Russian: МегаФон) , previously known as North-West GSM, is the second largest mobile phone operator and the third largest telecom operator in Russia. It works in the GSM, UMTS and LTE standard. As of June 2012, the company serves 62.1 million subscribers in Russia and 1.6 million in Tajikistan. It is headquartered in Moscow.

Link to the company’s website: https://megafon.ru/

Tutorials

Basic

It is better to start scikit-uplift from the basic tutorials.

The overview of the basic approaches to solving the Uplift Modeling problem

In English 🇬🇧

Open In Colab1

nbviewer

github

In Russian 🇷🇺

Open In Colab2

nbviewer

github

Uplift modeling metrics

In English 🇬🇧

Open In Colab1

nbviewer

github

Example of usage model from sklift.models in sklearn.pipeline

In English 🇬🇧

Open In Colab3

nbviewer

github

In Russian 🇷🇺

Open In Colab4

nbviewer

github

Example of usage model from sklift.models in sklearn.model_selection

In English 🇬🇧

Open In Colab5

nbviewer

github

Exploratory data analysis

The package contains various public datasets for uplift modeling. Below you find jupyter notebooks with EDA of these datasets and a simple baseline.

EDA of Lenta dataset

In English 🇬🇧

Open In Colab6

nbviewer

github

EDA of X5 dataset

In English 🇬🇧

Open In Colab7

nbviewer

github

EDA of Criteo dataset

In English 🇬🇧

Open In Colab8

nbviewer

github

EDA of Hillstrom dataset

In English 🇬🇧

Open In Colab9

nbviewer

github

EDA of Megafon dataset

In English 🇬🇧

Open In Colab10

nbviewer

github

Contributing to scikit-uplift

First off, thanks for taking the time to contribute! 🙌👍🎉

All development is done on GitHub: https://github.com/maks-sh/scikit-uplift.

Submitting a bug report or a feature request

We use GitHub issues to track all bugs and feature requests. Feel free to open an issue if you have found a bug or wish to see a feature implemented at https://github.com/maks-sh/scikit-uplift/issues.

Contributing code

How to contribute

The code in the master branch should meet the current release. So, please make a pull request to the dev branch.

  1. Fork the project repository.

  2. Clone your fork of the scikit-uplift repo from your GitHub account to your local disk:

    $ git clone https://github.com/YourName/scikit-uplift
    $ cd scikit-uplift
    
  3. Add the upstream remote. This saves a reference to the main scikit-uplift repository, which you can use to keep your repository synchronized with the latest changes:

    $ git remote add upstream https://github.com/maks-sh/scikit-uplift.git
    
  4. Synchronize your dev branch with the upstream dev branch:

    $ git checkout dev
    $ git pull upstream dev
    
  5. Create a feature branch to hold your development changes:

    $ git checkout -b feature/my_new_feature
    

    and start making changes. Always use a feature branch. It’s a good practice.

  6. Develop the feature on your feature branch on your computer, using Git to do the version control. When you’re done editing, add changed files using git add . and then git commit Then push the changes to your GitHub account with:

    $ git push -u origin feature/my_new_feature
    
  7. Create a pull request from your fork into dev branch.

Styleguides
Python

We follow the PEP8 style guide for Python. Docstrings follow google style.

Git Commit Messages
  • Use the present tense (“Add feature” not “Added feature”)

  • Use the imperative mood (“Move file to…” not “Moves file to…”)

  • Limit the first line to 72 characters or less

  • Reference issues and pull requests liberally after the first line

  • If you want to use emojis, use them at the beginning of the line.

Release History

Legend for changelogs

  • 🔥 something big that you couldn’t do before.

  • 💥 something that you couldn’t do before.

  • 📝 a miscellaneous minor improvement.

  • 🔨 something that previously didn’t work as documented – or according to reasonable expectations – should now work.

  • ❗️ you will need to change your code to have the same effect in the future; or a feature will be removed in the future.

Version 0.5.1

sklift.models
sklift.datasets
User Guide

Version 0.5.0

sklift.models
sklift.metrics
sklift.datasets
  • 💥 Add checker based on hash for all datasets by @flashlight101

  • 📝 Add scheme of x5 dataframes.

Miscellaneous

Version 0.4.1

sklift.datasets
  • 🔨 Fix bug in dataset links.

  • 📝 Add about a company section

Version 0.4.0

sklift.metrics
sklift.viz
sklift.datasets
Miscellaneous

Version 0.3.2

sklift.datasets
sklift.metrics
sklift.viz
Miscellaneous

Version 0.3.1

sklift.datasets
sklift.metrics
Miscellaneous

Version 0.3.0

sklift.datasets
sklift.models
sklift.metrics
sklift.viz
User Guide
  • 📝 Fix typos

Version 0.2.0

User Guide
sklift.models
sklift.metrics
sklift.viz
Miscellaneous
  • 💥 Add contributors in main Readme and in main page of docs.

  • 💥 Add contributing guide.

  • 💥 Add code of conduct.

  • 📝 Reformat Tutorials page.

  • 📝 Add github buttons in docs.

  • 📝 Add logo compatibility with pypi.

Version 0.1.2

sklift.models
  • 🔨 Fix bugs in TwoModels for regression problem.

  • 📝 Minor code refactoring.

sklift.metrics
  • 📝 Minor code refactoring.

sklift.viz

Version 0.1.1

sklift.viz
sklift.metrics
Miscellaneous

Version 0.1.0

sklift.models
  • 📝 Fix typo in TwoModels docstring by @spiaz.

  • 📝 Improve docstrings and add references to all approaches.

sklift.metrics
sklift.viz
Miscellaneous
  • ❗️ Remove sklift.preprocess submodule.

  • 💥 Add compatibility of tutorials with colab and add colab buttons by @ElMaxuno.

  • 💥 Add Changelog.

  • 📝 Change the documentation structure. Add next pages: Tutorials, Release History and Hall of fame.

Hall of Fame

Here are the links to the competitions, names of the winners and to their solutions, where scikit-uplift was used.

X5 Retail Hero: Uplift Modeling for Promotional Campaign

Predict how much the purchase probability could increase as a result of sending an advertising SMS.

  1. Kirill Liksakov

    solution


Papers and materials

  1. Gutierrez, P., & Gérardy, J. Y.

    Causal Inference and Uplift Modelling: A Review of the Literature. In International Conference on Predictive Applications and APIs (pp. 1-13).

  2. Artem Betlei, Criteo Research; Eustache Diemert, Criteo Research; Massih-Reza Amini, Univ. Grenoble Alpes

    Dependent and Shared Data Representations improve Uplift Prediction in Imbalanced Treatment Conditions FAIM’18 Workshop on CausalML.

  3. Eustache Diemert, Artem Betlei, Christophe Renaudin, and Massih-Reza Amini. 2018.

    A Large Scale Benchmark for Uplift Modeling. In Proceedings of AdKDD & TargetAd (ADKDD’18). ACM, New York, NY, USA, 6 pages.

  4. Athey, Susan, and Imbens, Guido. 2015.

    Machine learning methods for estimating heterogeneous causal effects. Preprint, arXiv:1504.01132. Google Scholar.

  5. Oscar Mesalles Naranjo. 2012.

    Testing a New Metric for Uplift Models. Dissertation Presented for the Degree of MSc in Statistics and Operational Research.

  6. Kane, K., V. S. Y. Lo, and J. Zheng. 2014.

    Mining for the Truly Responsive Customers and Prospects Using True-Lift Modeling: Comparison of New and Existing Methods. Journal of Marketing Analytics 2 (4): 218–238.

  7. Maciej Jaskowski and Szymon Jaroszewicz.

    Uplift modeling for clinical trial data. ICML Workshop on Clinical Data Analysis, 2012.

  8. Lo, Victor. 2002.

    The True Lift Model - A Novel Data Mining Approach to Response Modeling in Database Marketing. SIGKDD Explorations. 4. 78-86.

  9. Zhao, Yan & Fang, Xiao & Simchi-Levi, David. 2017.

    Uplift Modeling with Multiple Treatments and General Response Types. 10.1137/1.9781611974973.66.

  10. Nicholas J Radcliffe. 2007.

    Using control groups to target on predicted lift: Building and assessing uplift model. Direct Marketing Analytics Journal, (3):14–21, 2007.

  11. Devriendt, F., Guns, T., & Verbeke, W. 2020.

    Learning to rank for uplift modeling. ArXiv, abs/2002.05897.


Tags

EN: uplift modeling, uplift modelling, causal inference, causal effect, causality, individual treatment effect, true lift, net lift, incremental modeling

RU: аплифт моделирование, Uplift модель

ZH: uplift增量建模, 因果推断, 因果效应, 因果关系, 个体干预因果效应, 真实增量, 净增量, 增量建模