Causal Inference: Basics

In a perfect world, we want to calculate a difference in a person’s reaction received communication and the reaction without receiving any communication. But there is a problem: we can not make a communication (send an e-mail) and do not make a communication (no e-mail) at the same time.

Joke about Schrodinger's cat

Denoting \(Y_i^1\) person \(i\)’s outcome when receives the treatment (a presence of the communication) and \(Y_i^0\) \(i\)’s outcome when he receives no treatment (control, no communication), the causal effect \(\tau_i\) of the treatment vis-a-vis no treatment is given by:

\[\tau_i = Y_i^1 - Y_i^0\]

Researchers are typically interested in estimating the Conditional Average Treatment Effect (CATE), that is, the expected causal effect of the treatment for a subgroup in the population:

\[CATE = E[Y_i^1 \vert X_i] - E[Y_i^0 \vert X_i]\]

Where \(X_i\) - features vector describing \(i\)-th person.

We can observe neither causal effect nor CATE for the \(i\)-th object, and, accordingly, we can’t optimize it. But we can estimate CATE or uplift of an object:

\[\textbf{uplift} = \widehat{CATE} = E[Y_i \vert X_i = x, W_i = 1] - E[Y_i \vert X_i = x, W_i = 0]\]


  • \(W_i \in {0, 1}\) - a binary variable: 1 if person \(i\) receives the treatment treatment group, and 0 if person \(i\) receives no treatment control group;
  • \(Y_i\) - person \(i\)’s observed outcome, which is actually equal:
\[\begin{split}Y_i = W_i * Y_i^1 + (1 - W_i) * Y_i^0 = \ \begin{cases} Y_i^1, & \mbox{if } W_i = 1 \\ Y_i^0, & \mbox{if } W_i = 0 \\ \end{cases}\end{split}\]

This won’t identify the CATE unless one is willing to assume that \(W_i\) is independent of \(Y_i^1\) and \(Y_i^0\) conditional on \(X_i\). This assumption is the so-called Unconfoundedness Assumption or the Conditional Independence Assumption (CIA) found in the social sciences and medical literature. This assumption holds true when treatment assignment is random conditional on \(X_i\). Briefly this can be written as:

\[CIA : \{Y_i^0, Y_i^1\} \perp \!\!\! \perp W_i \vert X_i\]

Also introduce additional useful notation. Let us define the propensity score, \(p(X_i) = P(W_i = 1| X_i)\), i.e. the probability of treatment given \(X_i\).


1️⃣ Gutierrez, P., & Gérardy, J. Y. (2017). Causal Inference and Uplift Modelling: A Review of the Literature. In International Conference on Predictive Applications and APIs (pp. 1-13).