In this practical, we’ll get to practice with some methods for causal discovery or causal structure learning.
Note: The exercises for different methods can be done in any order (e.g., ICP, Constraint-Based, or Constraint-Based, Restricted SCMs). We also provide you with many more exercises than can be done in this afternoons session. We recommend you pick the causal discovery method you found most interesting, and start with the exercises related to that method in class. The exercises for the first method are longer and more involved, while the exercises for ICP are very short. The rest you can study after class if desired!
The below code will install all necessary packages to do all of the exercises; these include many that were not specified in the original setup document, so please run this in the background while you read the exercises!
Code
# Install necessary packages if not already installedrequired_pkgs <-c("pcalg","qgraph","InvariantCausalPrediction", "ppcor","dHSIC","mgcv")cran_repo <-"https://mirror.lyrahosting.com/CRAN/"for (pkg in required_pkgs) {if (!requireNamespace(pkg, quietly =TRUE)) {install.packages(pkg, repos=cran_repo) }}# to do the pc algorithm exercises you also need to# install graph from bioconductorif (!require("BiocManager", quietly =TRUE)){install.packages("BiocManager")}BiocManager::install("graph")BiocManager::install("RBGL")BiocManager::install("Rgraphviz")suppressPackageStartupMessages({# Load packageslibrary(pcalg)library(qgraph)library(InvariantCausalPrediction)library(ppcor)library(graph)})
Constraint-Based Causal Discovery
In these exercises we will get to understand how causal discovery methods based on testing for conditional independence in a single observational dataset work. You’ll be asked to program your own version of the PC algorithm from the ground up. This involves quite a few steps which will help you understand what’s really going on in the PC algorithm. We’ll then apply the PC algorithm directly using the pcalg package.
In this exercise you will get some practice with searching for causal structure using conditional independence testing. We have provided you with a dataset with four random variables, available from here
Code
data <-readRDS("data_cd_ex1.RDS")
In the lecture we discussed that the most basic approach to doing this is to test all possible (marginal and) conditional independence relations present in the data, and then try to draw all corresponding DAGs.
Exercise 1
Write down all possible conditional and marginal independence relations it is possible to test in this dataset. We have four variables in total, so that means we need to test: a) all marginal dependencies, b) all conditional dependencies where we condition on one other variable, and c) all conditional dependencies where we condition on two other variables. There are 24 in total!
Code
# You don't need R to do this, but for convenience I do the following:# We have four variables, which means there are 6 marginal relationships to testp <-4names <-paste0("X",1:p)marginal_string <-t(combn(names,2))colnames(marginal_string) <-c("DV1","DV2")# We next list all of the conditional relationships. # First, all possible conditional relationships# given a single conditioning variable# There are 6 x 2 = 12 of these # (each bivariate relationship conditioned on each of the other remaining two variables)cond1_string <-matrix("NA",12,3)colnames(cond1_string) <-c("DV1", "DV2","Conditional On")cond1_string[seq(1,11,2),c(1,2)] <- cond1_string[seq(2,13,2),c(1,2)] <- marginal_stringfor(i inseq(1,11,2)){ cv <- names[!names %in% cond1_string[i,c(1,2)]] cond1_string[i,3] <- cv[1] ; cond1_string[i+1,3] <- cv[2]}# We also have to consider bivariate relationships conditional on both remaining variables# There are six of thesecond2_string <-cbind(marginal_string,"NA", "NA")colnames(cond2_string)[3:4] <-c("Conditional on", "And")for(i in1:6){ cv <- names[!names %in% cond2_string[i,c(1,2)]] cond2_string[i,3] <- cv[1] ; cond2_string[i,4] <- cv[2]}
Assume Gaussian noise and linear relationships. This means that we can use correlations to test for marginal independence and partial correlations to test for conditional independence. Partial correlations are similar to regression coefficients (in that they express conditional relationships), but a) don’t have a direction, and b) like correlations they take on a value between \(-1\) and \(1\).
Test each of these relationships using partial correlations estimated from the data provided (with an alpha level of \(.05\)). You can use the ppcor package for this, for example. The cor.test function can be used for calculating correlations and accompanying p-values; the pcor.test function is useful when you want to condition on one variable, while pcor tests independence between pairs of variables, given all other variables in the dataset.
List all of the independencies that you find. That is, what variables are marginally or conditionally independent of one another and under what conditions?
Code
# Remember that the null hypothesis for each test is that the two variables are independent.# Use an alpha of .05 for each test.alpha <- .05# First let's test those marginal correlationsmarg_p <-apply(marginal_string,1,function(r){cor.test(data[,r[1]], data[,r[2]])$p.value})# Now test the first set of conditional dependenciesc1_p <-apply(cond1_string,1,function(r){pcor.test(data[,r[1]], data[,r[2]], data[,r[3]])$p.value})# and the second set of conditional dependenciesc2_pmat <-pcor(data)$p.value # matrix of p-values for each pair given all other variablesc2_p <- c2_pmat[lower.tri(c2_pmat)]# Remember that the null hypothesis for each test is that the two variables are independent# So, if p < alpha, we reject the null hypothesis (and infer dependence)# if p > alpha we fail to reject the null (and infer independence)# Again I make a table to show this - not necessary for you to be to do!marg <-cbind(marginal_string,ifelse(marg_p < alpha, "Dependent", "Independent"))c1 <-cbind(cond1_string,ifelse(c1_p < alpha, "Dependent", "Independent"))c2 <-cbind(cond2_string,ifelse(c2_p < alpha, "Dependent", "Independent"))
paste0(c1[7,1], " is Independent of ", c1[7,2], " given ", c1[7,3])
[1] "X2 is Independent of X3 given X1"
Code
paste0(c2[3,1], " is Independent of ", c2[3,2], " given {", c2[3,3], " , ", c2[3,4], " }")
[1] "X1 is Independent of X4 given {X2 , X3 }"
Now we want to use the independence relations we found above to infer the underlying DAG structure, assuming sufficiency and faithfulness. You could of course take a ``brute force’’ approach to this by drawing all of the four-variable DAGs that are possible, and ruling out one-by-one those that don’t imply those independence relations. But it turns out there is an easier and more efficient way to do this. This method uses two principles. Here’s the first:
Principle 1: Two variables \(A\) and \(B\) are directly connected in the DAG (either \(A \rightarrow B\) OR \(B \rightarrow A\)) if and only if they are dependent conditional on every possible subset of the other variables
Exercise 3
Use this first principle to draw the skeleton of the DAG. Start by drawing an undirected graph where every variable is connected to every other variable. Then, remove edges between variables if they are either marginally or conditionally independent in any of the tests in the previous exercise. Tip: Use qgraph to make your undirected graph. Undirected graphs have a symmetric adjacency matrix, but you can also use the directed = FALSE option.
Code
library(qgraph)# Adjacency matrix for a ``full'' undirected graphadj_full <-matrix(1,4,4)diag(adj_full) <-0# make the layout custom (optional)layout =matrix(c(0,1,-1,0,1,0,0,-1),4,2,byrow = T)# make the adjacency matrix after deleting desired arrowsadj <- adj_fulladj[2,3] <- adj[3,2] <-0adj[1,4] <- adj[4,1] <-0# make the layout custom (optional)layout =matrix(c(0,1,-1,0,1,0,0,-1),4,2,byrow = T)# qgraph(adj_full, labels = names, layout = layout, directed = FALSE, title = "Full Undirected Graph", title.cex = 1.25, vsize = 15)
Having obtained a skeleton, we can now try to give a direction to as many edges as possible. Recall from the lecture that our ability to orient the arrows in a DAG using only conditional independence information is reliant on the presence of collider structures. This leads us to our second principle for inferring DAG structures:
Principle 2: If our skeleton contains a triplet \(A - B - C\), we can orientate the arrows as \(A \rightarrow B \leftarrow C\) if and only if \(A\) and \(C\) are dependent conditional on every set of variables containing\(B\)
This is a slightly trickier principle to wrap your head around, but it essentially is just a re-statement of the d-seperation rules for colliders.
Exercise 4
Use this second principle to give a direction to as many arrows as possible in the skeleton. What is the resulting CPDAG? Tip: With qgraph, use bidirectional = TRUE or see the help for the directed argument.
Answers
There are four triplets you must consider: A) X2 - X1 - X3 B) X1 - X3 - X4 C) X1 - X2 - X4 D) X2 - X4 - X3
We rule out A) because we found above that X2 and X3 are independent given X1 We rule out C) and B) because we found that X1 is independent of X4 given (X2 , X3) But X2 and X3 are always dependent (given either X1, X4 or (X1, X4)). That means X2 -> X4 <- X3
Code
cpdag <- adjcpdag[4,2] <-0# we know that this arrow goes X2 -> X4cpdag[4,3] <-0# we know the direction is X3 -> X4# extra touch - making a mix of directed and undirected edgescptf <-matrix(FALSE, 4,4)cptf[2,4] <- cptf[3,4] <-TRUEpar(mfrow =c(1,1))qgraph(cpdag, labels = names, layout = layout, directed = cptf, title ="Estimated CPDAG", title.cex =1.25, asize =8, vsize =15)
Exercise 5
Draw all of the DAGs that make up the estimated Markov Equivalence Class. You might find it helpful to do this with pen and paper instead of in qgraph!
Notice that not all arrow orientations are allowed! You cannot create any new collider structures that aren’t already in the CPDAG. We already ruled those out in the last step!
The true DAG structure used for data generation is given below. Check that this is included in your estimated Markov Equivalence Class!
Notice that each of the graphs in the Markov Equivalence class imply the same set of statistically dependencies we would expect to see in observational data. This is one way in which we might say that different causal models are statistically equivalent, or to be more precise, markov equivalent. However, these DAGs are not causally equivalent.
Exercise 6
Imagine that we are interested in the effect of the intervention \(do(X_1 = 1)\) on the expected value of \(X_4\). Estimate the effect of this intervention from the observational data, using each of the DAGs in the Markov Equivalence class in turn to derive how this should be done. What do you notice?
According to the first DAG, we should adjust for X2.
Code
E_1
(Intercept)
2.713414
According to the second DAG, we should adjust for X3.
Code
E_2
(Intercept)
-1.063516
According to the third (true) DAG, we should adjust for nothing
Code
E_3
(Intercept)
1.664763
Even though each DAG is compatible with the same conditional (in)dependence relations, they each imply a different effect of the same intervention. We could find out which DAG is the right one by performing that intervention and comparing our estimates. But, we won’t do that in this practical! Because we simulated this data, we know the true effect of this intervention on the expected value of \(X4\) is \(1.65\).
The PC algorithm
The approach we took in the previous exercise worked, but was actually quite inefficient. Consider again the two principles we used to create first a skeleton and second a CPDAG from conditional independence tests in the previous exercise.
In order to omit the edge \(X2 - X3\) from the skeleton, all we needed to know was that they were independent given \(X1\). Remember: If two variables are directly causally dependent \(X2 \rightarrow X3\) or \(X2 \leftarrow X3\), then they will never be statistically independent, no matter what we condition on! So, once we knew that \(X2\) and \(X3\) were independent given \(X1\), there was actually no need to also test whether they were independent given \(X4\) or given \({X1, X4}\). Since independence only tells us that we should remove an edge from the skeleton, and we already removed the edge \(X2 - X3\), the information given by those last two tests wasn’t used to make any decisions about the skeleton, so we never needed to do those tests in the first place.
Once we have the skeleton, we need to look for potential collider structures by looking at triplets\(A - C - B\) where there is no direct edge between \(A\) and \(B\). If we have such a structure, we then need to test whether \(A\) and \(C\) are dependent given \(C\).
Rather than test all possible conditional independence relations, we could design an algorithm which uses these two insights in order to more efficiently estimate a CPDAG. Luckily, Spirtes, Glymour & Scheines (2000) already had this insight: This is the exact logic of their PC algorithm.
A full description of how the pcalg package works can be found here.
The function pcalg::pc() estimates the Markov Equivalence Class (CPDAG) using conditional independence tests as described above (assuming sufficiency and faithfulness). In order to do this, we need to tell the function what conditional independence test should be used thought the indepTest argument. pcalg comes with three pre-defined independence tests: gaussCItest for Gaussian variables, based on partial correlations, as well as discCItest and binCItest for discrete and binary variables, respectively. We also need to define an appropriate alpha level to be used for these independence tests. Here, let’s use the alpha = .01.
Finally, for reasons of computational efficiency, pc() doesn’t work with raw data but instead with a list containing sufficient statistics (suffStat): A summary of the relevant information from which the conditional independence tests can be calculated. Although this may seem strange in the current context, this helps speed things up when we have very large datasets. For Gaussian variables, the sufficient statistics are a) the correlation matrix and b) the sample size. See the examples under the pc() help file (?pc) for examples with binary and discrete data.
Code
suffStat <-list(C =cor(data), n =nrow(data))
Exercise 7
Let’s put all of this information together and use the PC algorithm with the data we gave you in the last exercise. Use the pc() function to estimate a Markov Equivalence Class and plot it.
This should be the same graph you estimated in the previous exercise. The pc function provides you with the CPDAG directly (though notice that, in the background, first the skeleton is estimated using skeleton and then the collider structures are found as described earlier).
Code
# This is the default plotting method for pcalg - uses Rgraphviz# You can also extract the adjacency matrix and plot using qgraph# Note that you have to transpose the matrix; pcalg writes matrices from column to row# cpdag_mat <- as(pc_fit1,"matrix")# qgraph(t(cpdag_mat), labels = names, layout = layout, directed = cptf, title = "Estimated CPDAG", title.cex = 1.25, asize = 8)plot(pc_fit1, main ="Inferred CPDAG using pcalg")
Loading required namespace: Rgraphviz
We can extract all of the separate DAGs in the equivalence class using the following code
Code
# Extract the adjacency matrix of the cpdag from pc_fit1cpdag_mat <-as(pc_fit1,"matrix")# Each row is a DAG adjacency matrix in vector form (by rows)res1 <-pdag2allDags(cpdag_mat)# We can get the adjacency matrix of an individual DAG usingres1_dags <-list()for(i in1:nrow(res1$dags)){ res1_dags[[i]] <-t(matrix(res1$dags[i,],4,4,byrow =TRUE))}# Notice we have to transpose the adjacency matrix here for qgraph!# We can plot each of these just as we did abovepar(mfrow =c(1,3))for(i in1:3){qgraph(res1_dags[[i]], labels = names, layout = layout, directed =TRUE, asize =8, vsize =15)}
We can use the ida() function to estimate the effect of an intervention according to each of the DAGs in the Markov Equivalence set. To do this we need to provide the output of the pc function and the covariance matrix of the data. So, to find the average causal effect of \(X_1\) on \(X_4\) (that is, the contrast between \(E[Y | do(X = 1)]\) and \(E[Y | do(X = 0)]\)) we would use the code
This output gives us three numbers, each corresponding to an estimate of the effect of this intervention in one of the three DAGs in the Markov Equivalence set. By specifying verbose = TRUE we can see what regressions each number comes from. Compare this to the results you obtained from manually calculating this in the previous exercise - you should see you get approximately the same results. Differences come from the fact that we use raw data above, and only summary statistics here. (Note that the order of the effects does not necessarily correspond to the order in which the DAGs are plotted above!)
Now that you’ve seen how the pcalg package works, let’s try it out on a new example. Suppose that you know the true DAG is given by the following graph
What do you expect the PC algorithm to be able to recover? What is the CPDAG for this DAG?
Code
# You can work this out by hand, or ``cheat'' by using the pcalg packageadjmat2 <-matrix(c(0,0,1,0,0,0,1,0,0,0,0,0,0,1,1,0),4,4,byrow =TRUE)colnames(adjmat2) <-rownames(adjmat2) <- namesg2 <-as(adjmat2,"graphNEL") # convert to graphNEL object for pcalg# This function converts a DAG to a CPDAGcpdag2 <-dag2cpdag(g2)
Answers
Code
plot(cpdag2)
Suppose that we have the following data, generated according to the above DAG
Code
set.seed(1234)n <-3000A <-rnorm(n)D <-rnorm(n)B <-0.50* D +rnorm(n)C <--0.75* A + B + D +rnorm(n)data2 <-cbind(A, B, C, D)
Exercise 9
Use the pcalg package to estimate the CPDAG. Do you obtain the correct CPDAG?
Estimate the effect on \(C\) of an intervention to increase \(A\) by one unit. Then, estimate the effect on \(C\) of an intervention to increase \(D\) by one unit. Use the ida() function as we demonstrated above
Answers
Code
# Part 1: Effect of do(A) on Cida(1,3,cov(data2), pc_fit2@graph, verbose =TRUE)
Just as we saw in the previous answer, we are typically uncertain about the effect of interventions,because we are uncertain about the causal structure. Here however, we see the power of identifying collider structures. In this case we have identified that A -> C, so we only get a single estimate for the effect of an intervention on A. However, we are not certain whether D has only a direct effect or also an indirect effect on A. Because we don’t know whether B -> D or D -> B. So, we get two different estimates of the intervention effect.
Restricted SCMs
In the lecture we learned about a second strategy for causal discovery which involves restricting the causal model in some way. The basic principle is that, if we assume either a) non-Gaussian additive noise or b) non-linear relationships with additive noise, then we can discover the causal direction of arrows in the DAG. When we have two variables \(X\) and \(Y\) the basic approach is to fit a regression model in both ``directions’’ and then test for independence of the error and the predictor of the model.
Below we will get some practice with applying these techniques to bivariate systems, so called `cause-effect pairs’. We will also see that causal discovery based on linear but non-Gaussian models can be easily scaled up to the multivariate case using the so called LiNGAM (Linear Non-Gaussian Acyclic Models) algorithm.
Non-Gaussian Cause-Effect Pairs
Let’s start by reproducing the example given in the lecture slides, where the noise distribution of \(Y\) is given by a uniform distribution \(U(lb, ub)\). In a uniform distribution, all values between the lower bound \(lb\) and upper-bound \(ub\) are equally likely. For example, a uniform distribution \(U(-1,1)\) looks like this (with 1000 samples):
Generate data from the SCM: \(Y\)\(:=\)\(X\)\(+\)\(\epsilon\) where \(X\) and \(\epsilon\) are drawn from a uniform distribution, as given above.
Code
set.seed(1234)n <-5000noise <-runif(n,-1,1)X <-runif(n, -1,1)Y = X + noise
Exercise 12
Fit a linear regression model in the correct causal direction (\(Y\) predicted by \(X\)) and the incorrect causal direction (\(X\) predicted by \(Y\)). Save the residuals for each model. Recreate the figures from the lecture. That is, for each model, plot a) the predictor against the outcome variable and b) the predictor variable against the residuals. What do you notice?
Code
library(dHSIC)myx <-lm(Y~X)mxy <-lm(X~Y)# save residualsmyx_r <- myx$residualsmxy_r <- mxy$residuals
Answers
Code
# First, for the "correct" causal model Y = X + epsilonpar(mfrow=c(1,2))plot(X, Y, col ="blue", xlab ="X", ylab ="Y", ylim =c(-2,2), xlim =c(-2,2), lwd =2)abline(myx, col ="red")abline(h =0)abline(v =0)plot(X,myx$residuals, col ="red", ylab =expression(epsilon[Y]))abline(h =0)abline(v =0)
Code
# Second, for the incorrect causal model X = Y + epsilonpar(mfrow=c(1,2))plot(Y, X, col ="blue", xlab ="Y", ylab ="X", ylim =c(-2,2), xlim =c(-2,2), lwd =2)abline(mxy, col ="red")abline(h =0)abline(v =0)plot(Y,mxy$residuals, col ="red", ylab =expression(epsilon[X]))abline(h =0)abline(v =0)
Exercise 13
We now want to try and learn the causal direction from our simulated data. We can this by testing, for each model, whether the predictor variable is independent of the residuals. For this you can use the dhsic.test function from the package dHSIC that we used in last weeks lab. Do you recover the correct causal direction?
We fail to reject the null hypothesis X ind \(\epsilon_y\). We reject the null hypothesis Y ind \(\epsilon_x\). Hence, the model where the predictor is independent of the error is Y = X + \(\epsilon\). So, we recover the correct causal direction
Exercise 14
Repeat the above process, but now choosing your own distribution for the error. Try first generating data from a distribution and plotting it to understand the distribution. You can try, for example, chi-squared rchisq, exponential rexp or gamma distributions rgamma amongst many others. See the information on Distributions in the stats package for detail, or even start with a normal distribution, and apply some transformations (such as abs()) to make it non-normal. What do you find?
Non-Linear Cause-Effect Pairs
Now let’s get some practice working with non-linear SCMs. Let’s try and recreate the example shown in the lecture slides. The process you’ll take is very similar to what we did in the previous question. This time, let’s use Gaussian noise, but have Y be a non-linear function of X. In the lecture we used the causal model
\[
\begin{align*}
X &\sim \mathcal{N}(0,1.5^2) \\
Y &:= sin(X) + \epsilon_Y \quad \quad \epsilon_Y \sim {N}(0,0.2^2)
\end{align*}
\]
In the previous exercise we tried to determine causal direction using linear regression, but that’s obviously not suitable here. We want to put ourselves into the shoes of the naive researcher: Let’s imagine we know that either \(X \rightarrow Y\) or \(X \leftarrow Y\), and we know, whichever is true, the causal relationship will be non-linear. But, we don’t know exactly what that non-linear function is. So, we want to use a regression method that allows for any kind of non-linear relationship to be modelled.
Generalized Additive Models (GAMs) are the perfect tool for this purpose: They allow the outcome variable \(Y\) to be any smooth function of the predictor variable(s) \(X\). For details see chapter 7.7 of James, Witten, Hastie & Tibshirani (2018) Introduction to Statistical Learning. For our purposes here it suffices to know that we can fit a GAM to our data using gam(Y ∼ s(X)) from the package mgcv; the residuals can be accessed by gam(Y ∼ s(X))$residuals.
Now, fit a GAM regression model in both the correct causal direction (\(Y\) predicted by \(X\)) and the incorrect causal direction (\(X\) predicted by \(Y\)). Save the residuals
Code
library(mgcv)
Loading required package: nlme
This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.
Recreate the figures from the lecture. That is, for each model, plot a) the predictor against the outcome variable and b) the predictor variable against the residuals. What do you notice? Tip: First, plot the gam() model object itself, which will display the estimated non-linear function. Then plot the data and/or residuals using points(). What do you notice?
par(mfrow=c(1,2))plot(gyx, rug = F, xlab ="X", ylab ="Y", se = F, ylim =c(-2,2), xlim =c(-4.5,4.5), col ="red", lwd =2)points(X, Y, col ="blue")abline(h =0)abline(v =0)plot(X,gyx_r , col ="red", ylab =expression(epsilon[Y]))abline(h =0)abline(v =0)
Code
par(mfrow =c(1,2))plot(gxy, ylab ="X", xlab ="Y", rug = F, se = F, xlim =c(-2,2), ylim =c(-4.5,4.5), col ="red", lwd =2)points(Y, X, col ="blue")abline(h =0)abline(v =0)plot(Y,gxy_r, col ="red", ylab =expression(epsilon[X]))abline(h =0)abline(v =0)
Exercise 17
We now want to try and learn the causal direction from our simulated data. We can this by testing, for each model, whether the predictor variable is independent of the residuals. For this you can use the dhsic.test function from the package dHSIC. Do you recover the correct causal direction?
We fail to reject the null hypothesis X ind \(\epsilon_y\). We reject the null hypothesis Y ind \(\epsilon_x\). Hence, the model where predictor ind error is Y = X + \(\epsilon\). So, we recover the correct causal direction.
Try repeating the above process, but now choosing your own distribution for the error. For example, try using \(X^3\) or \(e^X\). Play around with making functions that are “closer” to linear than others!
Bonus I: LiNGAM - Multivariate Linear Non-Gaussian Models (Bonus)
The LiNGAM algorithm generalizes discovery of causal direction in linear Non-Gaussian models to the multivariate case. It turns out that, if our causal system is linear, non-Gaussian and acyclic, then there is typically only one DAG that implies independence between all error terms. The LiNGAM method is implemented in the pcalg package, function lingam().
Here we provide you with a simulated dataset which shares the same DAG as the simulated data in Exercise 2.1. In this case, however, the noise distributions are all uniform.
Code
data2 <-readRDS("data_cd_ex2.RDS")
Exercise
Use the lingam() function to estimate the DAG structure. What do you notice?
Code
lin_fit <- pcalg::lingam(data2)adjlin <-as(lin_fit, "amat")# very hacky way of transforming amat to a numeric matrixadjmat2 <-apply(adjlin,c(1,2),isTRUE) +0
Answers
LiNGAM succeeds in finding the data-generating DAG! No equivalence class needed
Code
qgraph(adjmat2, layout = layout)
Bonus II: Empirical Cause-Effect Pairs
So far we worked only with simulated cause-effect pairs. But researchers at the Unversity of Tuebingen have actually collected a little over 100 empirical datasets of real-life cause-effect pairs. In each case the true causal model is (thought to be) known. Some example pairs are (first cause, then effect):
Altitude and Temperature
Age and Height
\(CO_2\)- Emissions and energy use
Employment and Population
The following code downloads the above four datasets from the database
Try to find the causal direction in each of the four example datasets. In each case, try out using the linear non-Gaussian approach (you can use the lingam() function or the approach in earlier exercises) as well as the non-linear causal model approach which uses gam(). Are there situations where one works better than the other? You can download more examples and check the answers on the database website.
In this part of the practical you will get some hands on experience using Invariant Causal Prediction (ICP). ICP is different from the techniques we discussed previously in the lab in several ways.
First, ICP needs data taken from different environments. We will focus on the situation where we have both observational data and data from a setting where the causal system is undergoing an intervention or interventions of some kind. Second, ICP aims to recover only part of the causal DAG. We decide on a target variable \(Y\), and then try to learn what variables are direct causes of \(Y\). ICP works by looking for those conditional dependencies (i.e. predictive relationships) that stay the same across environments. There are extensions that can deal with learning the whole graph, but we will focus on this simpler case for now.
There are some important things to keep in mind when using ICP. It’s important to remember that, although we have often discussed interventions that set a variable to a constant value (i.e. \(do(X = 1)\)), it is possible to define many other types of interventions. For example, we can imagine that instead of forcing everyone to take a single aspirin tablet, we might randomly assign people to take some amount of aspirin according to a certain distribution. We can imagine other interventions which increase the mean of a variable, or how it reacts to other variables in the system (e.g. encouraging but not forcing everyone to take an aspirin, or to take an aspirin as soon as they feel any pain at all!). The neat thing about ICP is that we don’t need to know what variables were intervened on, or what those interventions are. The key thing is, however, that these interventions shouldn’t act directly on the target variable \(Y\). For example, if we randomly assign \(Y\) values, then, in the intervention environment, nothing will predict \(Y\) - we will have broken the direct causal links by random assignment.
With this refresher in place, let’s take a look at an example. We provide you with data (data_cd_ex3.RDS) simulated from a four-variable SCM under two conditions here.
Code
icpdata <-readRDS("data_cd_ex3.RDS")
The first 400 rows are observational data. The last 500 rows come from a setting some interventions are applied to the system. The environment (observational or interventional) is denoted by the variable in the fifth column ExpInd. The DAG of the SCM is shown below
Take it that we are interested in discovering which variables are direct causes of the variable \(Y\).
Exercise 18
Use the ICP() function from the package InvariantCausalPrediction to try and recover the direct causes of \(Y\). You’ll need to supply a matrix of possible predictor variables \(X\), the outcome variable of interest \(Y\) and an indicator variable \(ExpInd\). Otherwise, use the default settings. Check what variables show a significant causal effect using the summary() of the ICP output. Otherwise, look at the acceptedSets and pick out the variable (if any) that appears in all sets of predictors.
Code
library(InvariantCausalPrediction)icp_out <-ICP(X = icpdata[,c("X1","X2","X3")], Y = icpdata[,"Y"], ExpInd = icpdata[,"ExpInd"])
accepted set of variables 1
*** 25% complete: tested 2 of 8 sets of variables
accepted set of variables 1,2
accepted set of variables 1,2,3
Answers
There are three regression models whose parameters stay the same across environments. The model with X1 predictor, X1 and X2 as predictors, and X1 X2 and X3 as predictors. From the slides we take the so-called intersection - this just means we look at what variable is present in all of these models. That’s X1!
Code
icp_out$acceptedSets
[[1]]
[1] 1
[[2]]
[1] 1 2
[[3]]
[1] 1 2 3
Code
summary(icp_out)
Invariant Linear Causal Regression at level 0.01 (including multiplicity correction for the number of variables)
Variable X1 shows a significant causal effect
LOWER BOUND UPPER BOUND MAXIMIN EFFECT P-VALUE
X1 0.07 0.62 0.07 0.0075 **
X2 -0.12 0.44 0.00 1.0000
X3 -2.05 0.00 0.00 1.0000
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The summary is just another way of looking at the same output - we think X1 is the cause of Y. We also get an estimate of the strength of that causal relation. We also get an estimate of the strength of that causal relation. We correctly recover that X1 is the only direct cause of Y.
Exercise 19
We can perform a kind of ICP method ourselves using linear regression. Let’s not look at all possible regression models like the ICP method does, but instead just look at two: A) the regression of \(Y\) on its direct cause \(X1\) and B) the regression of \(Y\) on its non-cause \(X2\). Perform each regression model seperately in each environment, and inspect how the estimated parameters of each change or do not change across environments
Code
icpdata <-as.data.frame(icpdata)a1 <-lm(Y ~ X1, data = icpdata, subset = (icpdata[,"ExpInd"]==1))a2 <-lm(Y ~ X1, data = icpdata, subset = (icpdata[,"ExpInd"]==2))b1 <-lm(Y ~ X2, data = icpdata, subset = (icpdata[,"ExpInd"]==1))b2 <-lm(Y ~ X2, data = icpdata, subset = (icpdata[,"ExpInd"]==2))
Answers
There are three regression models whose parameters stay the same across environments. The model with X1 predictor, X1 and X2 as predictors, and X1 X2 and X3 as predictors. From the slides we take the so-called intersection - this just means we look at what variable is present in all of these models. That’s X1!
Code
print(a1); print(a2)
Call:
lm(formula = Y ~ X1, data = icpdata, subset = (icpdata[, "ExpInd"] ==
1))
Coefficients:
(Intercept) X1
0.01962 0.50835
Call:
lm(formula = Y ~ X1, data = icpdata, subset = (icpdata[, "ExpInd"] ==
2))
Coefficients:
(Intercept) X1
0.02994 0.50756
For the direct cause variable, the parameters are very similar across these two models
Code
print(b1); print(b2)
Call:
lm(formula = Y ~ X2, data = icpdata, subset = (icpdata[, "ExpInd"] ==
1))
Coefficients:
(Intercept) X2
0.0249 0.2950
Call:
lm(formula = Y ~ X2, data = icpdata, subset = (icpdata[, "ExpInd"] ==
2))
Coefficients:
(Intercept) X2
0.54860 -0.05284
For the non-causal variable X2, the parameters change quite a bit. This is a somewhat extreme example, but this is the basic principle!
Below I give you the code used to generate the observational and intervention data. Notice what those interventions are - an intervention to change the mean of \(X1\) and an intervention to randomly assign values of \(X2\). Try to play around with your own interventions and test out what ICP can give you. If one method doesn’t work, try another - some of the tests for invariance are more suitable for some types of systems and interventions than others (see the section test under ?ICP)
Code
n1 <-400n2 <-500ExpInd <-c(rep(1,n1), rep(2,n2))set.seed(123)# In the observational setting, X1 has a mean of zeroX1o <-rnorm(n1, 0, 1)# In the intervention setting, X1 has a mean of 1X1i <-1+rnorm(n2,0,1)X1 <-c(X1o, X1i)# Y is caused by X1Y <-0.5* X1 +rnorm(n1 + n2, 0, 1)# Observational: X2 is caused by X1X2o <-1.5* X1[1:n1] +rnorm(n1, 0,.5)# Intervention: X2 is randomly assignedX2i <-rnorm(n2,0,.5)X2 <-c(X2o, X2i)# X3 is caused by Y and X2X3 <--.4* Y + .2* X2 +rnorm(n1 + n2,0,.2)X <-cbind(X1, X2, X3)icpdata <-cbind(X1,X2,X3,Y,ExpInd)# ICP(X = icpdata[,c("X1","X2","X3")], Y = icpdata[,"Y"], ExpInd = icpdata[,"ExpInd"])
Additional Readings
Glymour, C., Zhang, K., & Spirtes, P. (2019). Review of causal discovery methods based on graphical models. Frontiers in Genetics, 10: 524. https://doi.org/10.3389/fgene.2019.00524
This paper covers two of the topics we considered in the lecture: 1) conditional-independence based methods (called constraint-based and score-based methods in the paper) and 2) restricted SCMS (called Functional Causal Models in the paper). It does not cover Invariant Causal Prediction (ICP) methods.
Invariant Causal Prediction is a relatively new technique, and so, tutorial-style papers are somewhat lacking. The paper introducing this technique is given below. Below we also provide a youtube link to a talk by Jonas Peters (the developer of ICP) which may be easier to follow.
Peters, J., Bühlmann, P., & Meinshausen, N. (2016). Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 947-1012.