Let \(Q(M)\) be any computable quantity of a model \(M\). We say that \(Q\) is identifiable in a class \(\mathbb{M}\) of models if, for any pairs of models \(M_1\) and \(M_2\) from \(\mathbb{M}\), \(Q(M_1) = Q(M_2)\) whenever \(P_{M_1} (y) = P_{M_2} (y)\). If our observations are limited and permit only a partial set \(F_M\) of features (of \(P_M(y)\)) to be estimated, we define \(Q\) to be identifiable from \(F_M\) if \(Q(M_1) = Q(M_2)\) whenever \(F_{M_1} = F_{M_2}\).
We have two different datasets, for which we know they came from the following DAG:
Figure 1: DAG U
Code
required_pkgs <-c("marginaleffects", "ggplot2", "data.table")cran_repo <-"https://mirror.lyrahosting.com/CRAN/"# <- a CRAN mirror in the Netherlands, can select another one from here https://cran.r-project.org/mirrors.htmlfor (pkg in required_pkgs) {if (!requireNamespace(pkg, quietly =TRUE)) {install.packages(pkg, repos=cran_repo) }}suppressPackageStartupMessages({# Load packageslibrary(marginaleffects)library(ggplot2)library(data.table)})source(here::here("practicals", "22_scms", "_makedatas.R"))datas <-make_datas()data1 <- datas[["data1"]]data2 <- datas[["data2"]]
we did not measure \(U\), can we estimate this target query based on the DAG, using the observed data
answer: no, there is an open back-door path through \(U\) which we cannot block as we did not observe that variable
data1 and data2 come from the same DAG but from different SCMs
How can this be? What does this mean?
answer: the endogenous variables have the same parents, so the DAG is the same. The structural equations are different
We can estimate four features of the observed distribution: \(P(Y=1|X=0),P(Y=1|X=1),P(Y=1),P(X=1)\). Observe that for data1 and data2, these are approximately the same (up to sampling variation)
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
-0.00897 0.0675 -0.133 0.894 0.2 -0.141 0.123
Term: x
Type: response
Comparison: 1 - 0
Code
avg_comparisons(fit2, variables="x")
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
0.393 0.029 13.5 <0.001 136.3 0.336 0.45
Term: x
Type: response
Comparison: 1 - 0
Explain how this proves (to statistical error) that our target query was not identified
answer: there were two datasets with two different underlying models. Both yielded the same distribution in terms of observed variables \(X,Y\), but when using the unobserved variable \(U\), we could see both models had different answers to our query.
Counterfactual computations
Use the following information on patient John:
age: 60
hypertension: true
diabetes: true
intervention: weight-loss program
survival-time: 10
In addition to the following structural equation, where u denotes an (unobserved) exogenous noise variable, such that \(E[u] = 0\) (i.e. the mean is 0):
Calculate the survival time for John, given that he took the weight-loss-program and survived 10 year, if he would not have taken the weigth-loss-program
answer:
step 1. abduction: infer John’s u
John’s expected survival time with the program (which he had) was 40 years. He lived for 10 years. We can infer that his \(u=-30\)
step 2. action: modify the treatment
We update his treatment status to ‘no weight-loss-program’, the formula is now the second answer to the previous question
step 3. predict:
Given John’s \(u=-30\) and his other observed values, we can now calculate that his expected survival time was \(5\) years if he would not have taken the weight-loss program.
note:
in this simple linear case, the counterfactual could have been calculated directly, but in general this is not the case
Meta-learners
Remember the definition of the conditional average treatment effect (CATE) from lecture 4
T-learner: model \(T=0\) and \(T=1\) separately (e.g. regression separetely for treated and untreated): \[\begin{align}
\mu_0(w) &= E[Y|\text{do}(T=0),W=w] \\
\mu_1(w) &= E[Y|\text{do}(T=1),W=w] \\
\tau(w) &= \mu_1(w) - \mu_0(w)
\end{align}\]
S-learner: use \(T\) as just another feature \[\begin{align}
\mu(t,w) &= E[Y|T=t,W=w] \\
\tau(w) &= \mu(1,w) - \mu(0,w)
\end{align}\]
With the following datasets:
Figure 2: Three datasets with the same DAG
what learning-approach would you recommend for estimating the CATE?
S-learner with simple basemodel and no interaction (e.g. linear regression)
S-learner with non-linear base model and no interaction term (e.g. splines / boosting / …)
T-learner
NOTE: we typically have data with multi-dimensional features and/or confounders. Having the above plot to decide on the right meta-learning approach is almost never possible.