Practical: Structural Causal Models and Meta-learners

Identification and Hierarchy of Questions

Author

Wouter van Amsterdam

In this practical you’ll learn more about identification and counterfactuals using the Structural Causal Model approach, and meta-learners

Identification

Remember the definition of identification in the lecture on SCMs:

Let \(Q(M)\) be any computable quantity of a model \(M\). We say that \(Q\) is identifiable in a class \(\mathbb{M}\) of models if, for any pairs of models \(M_1\) and \(M_2\) from \(\mathbb{M}\), \(Q(M_1) = Q(M_2)\) whenever \(P_{M_1} (y) = P_{M_2} (y)\). If our observations are limited and permit only a partial set \(F_M\) of features (of \(P_M(y)\)) to be estimated, we define \(Q\) to be identifiable from \(F_M\) if \(Q(M_1) = Q(M_2)\) whenever \(F_{M_1} = F_{M_2}\).

We have two different datasets, for which we know they came from the following DAG:

Figure 1: DAG U
Code
required_pkgs <- c("marginaleffects", "ggplot2", "data.table")
cran_repo <- "https://mirror.lyrahosting.com/CRAN/" # <- a CRAN mirror in the Netherlands, can select another one from here https://cran.r-project.org/mirrors.html

for (pkg in required_pkgs) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg, repos=cran_repo)
  }
}

suppressPackageStartupMessages({
  # Load packages
  library(marginaleffects)
  library(ggplot2)
  library(data.table)
})

source(here::here("practicals", "22_scms", "_makedatas.R"))
datas <- make_datas()

data1 <- datas[["data1"]]
data2 <- datas[["data2"]]

The datasets can be downloaded here:

data1.csv

data2.csv

answer: \[\text{ATE} = E[Y|\text{do}(X=1)] - E[Y|\text{do}(X=0)] \tag{1}\]

answer: no, there is an open back-door path through \(U\) which we cannot block as we did not observe that variable

data1 and data2 come from the same DAG but from different SCMs

answer: the endogenous variables have the same parents, so the DAG is the same. The structural equations are different

We can estimate four features of the observed distribution: \(P(Y=1|X=0),P(Y=1|X=1),P(Y=1),P(X=1)\). Observe that for data1 and data2, these are approximately the same (up to sampling variation)

Code
writeLines((paste0("data1\nP(Y=1|X=0) = ", mean(data1[data1$x==0,"y"]), "\nP(Y=1|X=1) = ", mean(data1[data1$x==1,"y"]), "\nP(Y=1) = ", mean(data1[,"y"]), "\nP(X=1) = ", mean(data1[,"x"]))))
data1
P(Y=1|X=0) = 0.338174273858921
P(Y=1|X=1) = 0.694980694980695
P(Y=1) = 0.523
P(X=1) = 0.518
Code
writeLines((paste0("data2\nP(Y=1|X=0) = ", mean(data2[data2$x==0,"y"]), "\nP(Y=1|X=1) = ", mean(data2[data2$x==1,"y"]), "\nP(Y=1) = ", mean(data2[,"y"]), "\nP(X=1) = ", mean(data2[,"x"]))))
data2
P(Y=1|X=0) = 0.294466403162055
P(Y=1|X=1) = 0.686234817813765
P(Y=1) = 0.488
P(X=1) = 0.494
Code
# mean(data1[data1$x==0,"y"])
# mean(data1[data1$x==1,"y"])
# mean(data1[,"y"])
# mean(data1[,"x"])
#
# mean(data2[data2$x==0,"y"])
# mean(data2[data2$x==1,"y"])
# mean(data2[,"y"])
# mean(data2[,"x"])
Code
fit1 <- glm(y~x*u, data=data1, family=binomial)
fit2 <- glm(y~x*u, data=data2, family=binomial)
avg_comparisons(fit1, variables="x")

 Estimate Std. Error      z Pr(>|z|)   S  2.5 % 97.5 %
 -0.00897     0.0675 -0.133    0.894 0.2 -0.141  0.123

Term: x
Type: response
Comparison: 1 - 0
Code
avg_comparisons(fit2, variables="x")

 Estimate Std. Error    z Pr(>|z|)     S 2.5 % 97.5 %
    0.393      0.029 13.5   <0.001 136.3 0.336   0.45

Term: x
Type: response
Comparison: 1 - 0

answer: there were two datasets with two different underlying models. Both yielded the same distribution in terms of observed variables \(X,Y\), but when using the unobserved variable \(U\), we could see both models had different answers to our query.

Counterfactual computations

Use the following information on patient John:

  • age: 60
  • hypertension: true
  • diabetes: true
  • intervention: weight-loss program
  • survival-time: 10

In addition to the following structural equation, where u denotes an (unobserved) exogenous noise variable, such that \(E[u] = 0\) (i.e. the mean is 0):

\[\text{survival-time} = 120 - \text{age} - 10*\text{hypertension} - 15*\text{diabetes} + 5*\text{weight-loss-program} + u\]

answer:

\[\begin{align} E[\text{survival-time}|\text{do}(\text{program}=0),...] &= E[120 - 60 - 10 - 15 + u] \\ &= 120 - 60 - 10 - 15 + E[u] \\ &= 35 + E[u] \\ &= 35 + 0 \\ &= 35 \\ \end{align}\]

\[\begin{align} E[\text{survival-time}|\text{do}(\text{program}=1),...] &= E[120 - 60 - 10 - 15 + 5 + u] \\ &= 120 - 60 - 10 - 15 + 5 + E[u] \\ &= 40 + E[u] \\ &= 40 + 0 \\ &= 40 \\ \end{align}\]

answer:

step 1. abduction: infer John’s u

John’s expected survival time with the program (which he had) was 40 years. He lived for 10 years. We can infer that his \(u=-30\)

step 2. action: modify the treatment

We update his treatment status to ‘no weight-loss-program’, the formula is now the second answer to the previous question

step 3. predict:

Given John’s \(u=-30\) and his other observed values, we can now calculate that his expected survival time was \(5\) years if he would not have taken the weight-loss program.

note:

in this simple linear case, the counterfactual could have been calculated directly, but in general this is not the case

Meta-learners

Remember the definition of the conditional average treatment effect (CATE) from lecture 4

\(\text{CATE}(w) = E[y|\text{do}(t=1),w] - E[y|\text{do}(t=0),w]\)

Rembember the definition of the T-learner and the S-learner from lecture 4:
  • denote \(\tau(w) = E[y|\text{do}(t=1),w] - E[y|\text{do}(t=0),w]\)
  • T-learner: model \(T=0\) and \(T=1\) separately (e.g. regression separetely for treated and untreated): \[\begin{align} \mu_0(w) &= E[Y|\text{do}(T=0),W=w] \\ \mu_1(w) &= E[Y|\text{do}(T=1),W=w] \\ \tau(w) &= \mu_1(w) - \mu_0(w) \end{align}\]
  • S-learner: use \(T\) as just another feature \[\begin{align} \mu(t,w) &= E[Y|T=t,W=w] \\ \tau(w) &= \mu(1,w) - \mu(0,w) \end{align}\]

With the following datasets:

Figure 2: Three datasets with the same DAG
  1. S-learner with simple basemodel and no interaction (e.g. linear regression)
  2. S-learner with non-linear base model and no interaction term (e.g. splines / boosting / …)
  3. T-learner

NOTE: we typically have data with multi-dimensional features and/or confounders. Having the above plot to decide on the right meta-learning approach is almost never possible.