Day 4 Practical: Causal Structure Learning

Author

Oisin Ryan

In this practical, we’ll get to practice with some methods for causal discovery or causal structure learning.

Note: The exercises for different methods can be done in any order (e.g., ICP, Constraint-Based, or Constraint-Based, Restricted SCMs). We also provide you with many more exercises than can be done in this afternoons session. We recommend you pick the causal discovery method you found most interesting, and start with the exercises related to that method in class. The exercises for the first method are longer and more involved, while the exercises for ICP are very short. The rest you can study after class if desired!

The below code will install all necessary packages to do all of the exercises; these include many that were not specified in the original setup document, so please run this in the background while you read the exercises!

Code

# Install necessary packages if not already installed
required_pkgs <- c("pcalg","qgraph","InvariantCausalPrediction", "ppcor",
                   "dHSIC","mgcv")
cran_repo <- "https://mirror.lyrahosting.com/CRAN/" 

for (pkg in required_pkgs) {
  if (!requireNamespace(pkg, quietly = TRUE)) {
    install.packages(pkg, repos=cran_repo)
  }
}

# to do the pc algorithm exercises you also need to
# install graph from bioconductor
if (!require("BiocManager", quietly = TRUE)){
    install.packages("BiocManager")}

BiocManager::install("graph")
BiocManager::install("RBGL")
BiocManager::install("Rgraphviz")

suppressPackageStartupMessages({
  # Load packages
  library(pcalg)
  library(qgraph)
  library(InvariantCausalPrediction)
  library(ppcor)
  library(graph)
})

Constraint-Based Causal Discovery

In these exercises we will get to understand how causal discovery methods based on testing for conditional independence in a single observational dataset work. You’ll be asked to program your own version of the PC algorithm from the ground up. This involves quite a few steps which will help you understand what’s really going on in the PC algorithm. We’ll then apply the PC algorithm directly using the pcalg package.

Do-it-yourself constraint-based structure learning

In this exercise you will get some practice with searching for causal structure using conditional independence testing. We have provided you with a dataset with four random variables, available from here

Code

data <- readRDS("data_cd_ex1.RDS")

In the lecture we discussed that the most basic approach to doing this is to test all possible (marginal and) conditional independence relations present in the data, and then try to draw all corresponding DAGs.

Exercise 1

Write down all possible conditional and marginal independence relations it is possible to test in this dataset. We have four variables in total, so that means we need to test: a) all marginal dependencies, b) all conditional dependencies where we condition on one other variable, and c) all conditional dependencies where we condition on two other variables. There are 24 in total!

Code

# You don't need R to do this, but for convenience I do the following:

# We have four variables, which means there are 6 marginal relationships to test
p <- 4
names <- paste0("X",1:p)
marginal_string <- t(combn(names,2))
colnames(marginal_string) <- c("DV1","DV2")

# We next list all of the conditional relationships. 
# First, all possible conditional relationships
# given a single conditioning variable
# There are 6 x 2 = 12 of these 
# (each bivariate relationship conditioned on each of the other remaining two variables)

cond1_string <- matrix("NA",12,3)
colnames(cond1_string) <- c("DV1", "DV2","Conditional On")
cond1_string[seq(1,11,2),c(1,2)] <- cond1_string[seq(2,13,2),c(1,2)]  <- marginal_string
for(i in seq(1,11,2)){
  cv <- names[!names %in% cond1_string[i,c(1,2)]]
  cond1_string[i,3] <- cv[1] ; cond1_string[i+1,3] <- cv[2]
}

# We also have to consider bivariate relationships conditional on both remaining variables
# There are six of these
cond2_string <- cbind(marginal_string,"NA", "NA")
colnames(cond2_string)[3:4] <- c("Conditional on", "And")
for(i in 1:6){
  cv <- names[!names %in% cond2_string[i,c(1,2)]]
  cond2_string[i,3] <- cv[1] ; cond2_string[i,4] <- cv[2]
}

Answers

Code

print(marginal_string)

     DV1  DV2 
[1,] "X1" "X2"
[2,] "X1" "X3"
[3,] "X1" "X4"
[4,] "X2" "X3"
[5,] "X2" "X4"
[6,] "X3" "X4"

Code

print(cond1_string)

      DV1  DV2  Conditional On
 [1,] "X1" "X2" "X3"          
 [2,] "X1" "X2" "X4"          
 [3,] "X1" "X3" "X2"          
 [4,] "X1" "X3" "X4"          
 [5,] "X1" "X4" "X2"          
 [6,] "X1" "X4" "X3"          
 [7,] "X2" "X3" "X1"          
 [8,] "X2" "X3" "X4"          
 [9,] "X2" "X4" "X1"          
[10,] "X2" "X4" "X3"          
[11,] "X3" "X4" "X1"          
[12,] "X3" "X4" "X2"

Code

print(cond2_string)

     DV1  DV2  Conditional on And 
[1,] "X1" "X2" "X3"           "X4"
[2,] "X1" "X3" "X2"           "X4"
[3,] "X1" "X4" "X2"           "X3"
[4,] "X2" "X3" "X1"           "X4"
[5,] "X2" "X4" "X1"           "X3"
[6,] "X3" "X4" "X1"           "X2"

Exercise 2

Assume Gaussian noise and linear relationships. This means that we can use correlations to test for marginal independence and partial correlations to test for conditional independence. Partial correlations are similar to regression coefficients (in that they express conditional relationships), but a) don’t have a direction, and b) like correlations they take on a value between $-1$ and $1$.

Test each of these relationships using partial correlations estimated from the data provided (with an alpha level of $.05$). You can use the ppcor package for this, for example. The cor.test function can be used for calculating correlations and accompanying p-values; the pcor.test function is useful when you want to condition on one variable, while pcor tests independence between pairs of variables, given all other variables in the dataset.

List all of the independencies that you find. That is, what variables are marginally or conditionally independent of one another and under what conditions?

Code

# Remember that the null hypothesis for each test is that the two variables are independent.
# Use an alpha of .05 for each test.
alpha <- .05
# First let's test those marginal correlations
marg_p <- apply(marginal_string,1,function(r){
  cor.test(data[,r[1]], data[,r[2]])$p.value
})

# Now test the first set of conditional dependencies
c1_p <- apply(cond1_string,1,function(r){
  pcor.test(data[,r[1]], data[,r[2]], data[,r[3]])$p.value
})

# and the second set of conditional dependencies
c2_pmat <- pcor(data)$p.value # matrix of p-values for each pair given all other variables
c2_p <- c2_pmat[lower.tri(c2_pmat)]

# Remember that the null hypothesis for each test is that the two variables are independent
# So, if p < alpha, we reject the null hypothesis (and infer dependence)
# if p > alpha we fail to reject the null (and infer independence)
# Again I make a table to show this - not necessary for you to be to do!

marg <- cbind(marginal_string,ifelse(marg_p < alpha, "Dependent", "Independent"))
c1 <- cbind(cond1_string,ifelse(c1_p < alpha, "Dependent", "Independent"))
c2 <- cbind(cond2_string,ifelse(c2_p < alpha, "Dependent", "Independent"))

Answers

Code

print(marg)

     DV1  DV2             
[1,] "X1" "X2" "Dependent"
[2,] "X1" "X3" "Dependent"
[3,] "X1" "X4" "Dependent"
[4,] "X2" "X3" "Dependent"
[5,] "X2" "X4" "Dependent"
[6,] "X3" "X4" "Dependent"

Code

print(c1)

      DV1  DV2  Conditional On              
 [1,] "X1" "X2" "X3"           "Dependent"  
 [2,] "X1" "X2" "X4"           "Dependent"  
 [3,] "X1" "X3" "X2"           "Dependent"  
 [4,] "X1" "X3" "X4"           "Dependent"  
 [5,] "X1" "X4" "X2"           "Dependent"  
 [6,] "X1" "X4" "X3"           "Dependent"  
 [7,] "X2" "X3" "X1"           "Independent"
 [8,] "X2" "X3" "X4"           "Dependent"  
 [9,] "X2" "X4" "X1"           "Dependent"  
[10,] "X2" "X4" "X3"           "Dependent"  
[11,] "X3" "X4" "X1"           "Dependent"  
[12,] "X3" "X4" "X2"           "Dependent"

Code

print(c2)

     DV1  DV2  Conditional on And               
[1,] "X1" "X2" "X3"           "X4" "Dependent"  
[2,] "X1" "X3" "X2"           "X4" "Dependent"  
[3,] "X1" "X4" "X2"           "X3" "Independent"
[4,] "X2" "X3" "X1"           "X4" "Dependent"  
[5,] "X2" "X4" "X1"           "X3" "Dependent"  
[6,] "X3" "X4" "X1"           "X2" "Dependent"

Code

paste0(c1[7,1], " is Independent of ", c1[7,2], " given ", c1[7,3])

[1] "X2 is Independent of X3 given X1"

Code

paste0(c2[3,1], " is Independent of ", c2[3,2], " given {", c2[3,3], " , ", c2[3,4], " }")

[1] "X1 is Independent of X4 given {X2 , X3 }"

Now we want to use the independence relations we found above to infer the underlying DAG structure, assuming sufficiency and faithfulness. You could of course take a ``brute force’’ approach to this by drawing all of the four-variable DAGs that are possible, and ruling out one-by-one those that don’t imply those independence relations. But it turns out there is an easier and more efficient way to do this. This method uses two principles. Here’s the first:

Principle 1: Two variables $A$ and $B$ are directly connected in the DAG (either $A \rightarrow B$ OR $B \rightarrow A$) if and only if they are dependent conditional on every possible subset of the other variables

Exercise 3

Use this first principle to draw the skeleton of the DAG. Start by drawing an undirected graph where every variable is connected to every other variable. Then, remove edges between variables if they are either marginally or conditionally independent in any of the tests in the previous exercise. Tip: Use qgraph to make your undirected graph. Undirected graphs have a symmetric adjacency matrix, but you can also use the directed = FALSE option.

Code

library(qgraph)

# Adjacency matrix for a ``full'' undirected graph
adj_full <- matrix(1,4,4)
diag(adj_full) <- 0

# make the layout custom (optional)
layout = matrix(c(0,1,-1,0,1,0,0,-1),4,2,byrow = T)

# make the adjacency matrix after deleting desired arrows
adj <- adj_full
adj[2,3] <- adj[3,2] <- 0
adj[1,4] <- adj[4,1] <- 0

# make the layout custom (optional)
layout = matrix(c(0,1,-1,0,1,0,0,-1),4,2,byrow = T)

# qgraph(adj_full, labels = names, layout = layout, directed = FALSE, title = "Full Undirected Graph", title.cex = 1.25, vsize = 15)

Answers

Code

par(mfrow=c(1,2))
qgraph(adj_full, labels = names, layout = layout, directed = FALSE, title = "Full Undirected Graph", title.cex = 1.25, vsize = 15)
qgraph(adj, labels = names, layout = layout, directed = FALSE, title = "Estimated Skeleton", title.cex = 1.25, vsize = 15)

Having obtained a skeleton, we can now try to give a direction to as many edges as possible. Recall from the lecture that our ability to orient the arrows in a DAG using only conditional independence information is reliant on the presence of collider structures. This leads us to our second principle for inferring DAG structures:

Principle 2: If our skeleton contains a triplet $A - B - C$, we can orientate the arrows as $A \rightarrow B \leftarrow C$ if and only if $A$ and $C$ are dependent conditional on every set of variables containing $B$

This is a slightly trickier principle to wrap your head around, but it essentially is just a re-statement of the d-seperation rules for colliders.

Exercise 4

Use this second principle to give a direction to as many arrows as possible in the skeleton. What is the resulting CPDAG? Tip: With qgraph, use bidirectional = TRUE or see the help for the directed argument.

Answers

There are four triplets you must consider: A) X2 - X1 - X3 B) X1 - X3 - X4 C) X1 - X2 - X4 D) X2 - X4 - X3

We rule out A) because we found above that X2 and X3 are independent given X1 We rule out C) and B) because we found that X1 is independent of X4 given (X2 , X3) But X2 and X3 are always dependent (given either X1, X4 or (X1, X4)). That means X2 -> X4 <- X3

Code

cpdag <- adj
cpdag[4,2] <- 0 # we know that this arrow goes X2 -> X4
cpdag[4,3] <- 0 # we know the direction is X3 -> X4

# extra touch - making a mix of directed and undirected edges
cptf <- matrix(FALSE, 4,4)
cptf[2,4] <- cptf[3,4] <- TRUE

par(mfrow = c(1,1))
qgraph(cpdag, labels = names, layout = layout, directed = cptf, title = "Estimated CPDAG", title.cex = 1.25, asize = 8, vsize = 15)

Exercise 5

Draw all of the DAGs that make up the estimated Markov Equivalence Class. You might find it helpful to do this with pen and paper instead of in qgraph!

Answers

Code

dag1 <- matrix(c(
   0  ,  0  ,  1  ,  0,
   1  ,  0  ,  0  ,  1,
   0  ,  0  ,  0  ,  1,
   0  ,  0  ,  0  ,  0
), 4, 4, byrow = T)

dag2 <- matrix(c(
   0  ,  1  ,  0  ,  0,
   0  ,  0  ,  0  ,  1,
   1  ,  0  ,  0  ,  1,
   0  ,  0  ,  0  ,  0
), 4, 4, byrow = T)

dag3 <- matrix(c(
   0  ,  1  ,  1  ,  0,
   0  ,  0  ,  0  ,  1,
   0  ,  0  ,  0  ,  1,
   0  ,  0  ,  0  ,  0
), 4, 4, byrow = T)

par(mfrow = c(1,3))
qgraph(dag1, labels = names, layout = layout, directed = TRUE, asize = 8, vsize = 15)
qgraph(dag2, labels = names, layout = layout, directed = TRUE, title = "Estimated Markov Equiv. Class", title.cex = 1.25, asize = 8, vsize = 15)
qgraph(dag3, labels = names, layout = layout, directed = TRUE, asize = 8, vsize = 15)

Notice that not all arrow orientations are allowed! You cannot create any new collider structures that aren’t already in the CPDAG. We already ruled those out in the last step!

The true DAG structure used for data generation is given below. Check that this is included in your estimated Markov Equivalence Class!

Code

qgraph(dag3, labels = names, layout = layout, directed = TRUE, asize = 8, vsize = 15,
       title = "True DAG")

Notice that each of the graphs in the Markov Equivalence class imply the same set of statistically dependencies we would expect to see in observational data. This is one way in which we might say that different causal models are statistically equivalent, or to be more precise, markov equivalent. However, these DAGs are not causally equivalent.

Exercise 6

Imagine that we are interested in the effect of the intervention $do(X_1 = 1)$ on the expected value of $X_4$. Estimate the effect of this intervention from the observational data, using each of the DAGs in the Markov Equivalence class in turn to derive how this should be done. What do you notice?

Code

m1 <- lm(X4 ~ X1 + X2, data = as.data.frame(data))$coefficients
E_1 <- m1[1] + m1[2]*1 + m1[3]*mean(data[,"X2"])

m2 <- lm(X4 ~ X1 + X3, data = as.data.frame(data))$coefficients
E_2 <- m2[1] + m2[2]*1 + m2[3]*mean(data[,"X3"])

m3 <- lm(X4 ~ X1, data = as.data.frame(data))$coefficients
E_3 <- m3[1] + m3[2]*1

Answers

According to the first DAG, we should adjust for X2.

Code

E_1

(Intercept) 
   2.713414

According to the second DAG, we should adjust for X3.

Code

E_2

(Intercept) 
  -1.063516

According to the third (true) DAG, we should adjust for nothing

Code

E_3

(Intercept) 
   1.664763

Even though each DAG is compatible with the same conditional (in)dependence relations, they each imply a different effect of the same intervention. We could find out which DAG is the right one by performing that intervention and comparing our estimates. But, we won’t do that in this practical! Because we simulated this data, we know the true effect of this intervention on the expected value of $X4$ is $1.65$.

The PC algorithm

The approach we took in the previous exercise worked, but was actually quite inefficient. Consider again the two principles we used to create first a skeleton and second a CPDAG from conditional independence tests in the previous exercise.

In order to omit the edge $X2 - X3$ from the skeleton, all we needed to know was that they were independent given $X1$. Remember: If two variables are directly causally dependent $X2 \rightarrow X3$ or $X2 \leftarrow X3$, then they will never be statistically independent, no matter what we condition on! So, once we knew that $X2$ and $X3$ were independent given $X1$, there was actually no need to also test whether they were independent given $X4$ or given ${X1, X4}$. Since independence only tells us that we should remove an edge from the skeleton, and we already removed the edge $X2 - X3$, the information given by those last two tests wasn’t used to make any decisions about the skeleton, so we never needed to do those tests in the first place.
Once we have the skeleton, we need to look for potential collider structures by looking at triplets $A - C - B$ where there is no direct edge between $A$ and $B$. If we have such a structure, we then need to test whether $A$ and $C$ are dependent given $C$.

Rather than test all possible conditional independence relations, we could design an algorithm which uses these two insights in order to more efficiently estimate a CPDAG. Luckily, Spirtes, Glymour & Scheines (2000) already had this insight: This is the exact logic of their PC algorithm.

A full description of how the pcalg package works can be found here.

The function pcalg::pc() estimates the Markov Equivalence Class (CPDAG) using conditional independence tests as described above (assuming sufficiency and faithfulness). In order to do this, we need to tell the function what conditional independence test should be used thought the indepTest argument. pcalg comes with three pre-defined independence tests: gaussCItest for Gaussian variables, based on partial correlations, as well as discCItest and binCItest for discrete and binary variables, respectively. We also need to define an appropriate alpha level to be used for these independence tests. Here, let’s use the alpha = .01.

Finally, for reasons of computational efficiency, pc() doesn’t work with raw data but instead with a list containing sufficient statistics (suffStat): A summary of the relevant information from which the conditional independence tests can be calculated. Although this may seem strange in the current context, this helps speed things up when we have very large datasets. For Gaussian variables, the sufficient statistics are a) the correlation matrix and b) the sample size. See the examples under the pc() help file (?pc) for examples with binary and discrete data.

Code

suffStat <- list(C = cor(data), n = nrow(data))

Exercise 7

Let’s put all of this information together and use the PC algorithm with the data we gave you in the last exercise. Use the pc() function to estimate a Markov Equivalence Class and plot it.

Code

pc_fit1 <- pcalg::pc(suffStat = suffStat, indepTest = gaussCItest,
                      p = ncol(data), alpha = 0.01)

Answers

This should be the same graph you estimated in the previous exercise. The pc function provides you with the CPDAG directly (though notice that, in the background, first the skeleton is estimated using skeleton and then the collider structures are found as described earlier).

Code

# This is the default plotting method for pcalg - uses Rgraphviz
# You can also extract the adjacency matrix and plot using qgraph
# Note that you have to transpose the matrix; pcalg writes matrices from column to row
 # cpdag_mat <- as(pc_fit1,"matrix")
 # qgraph(t(cpdag_mat), labels = names, layout = layout, directed = cptf, title = "Estimated CPDAG", title.cex = 1.25, asize = 8)
plot(pc_fit1, main = "Inferred CPDAG using pcalg")

Loading required namespace: Rgraphviz

We can extract all of the separate DAGs in the equivalence class using the following code

Code

# Extract the adjacency matrix of the cpdag from pc_fit1
cpdag_mat <- as(pc_fit1,"matrix")

# Each row is a DAG adjacency matrix in vector form (by rows)
res1 <- pdag2allDags(cpdag_mat)

# We can get the adjacency matrix of an individual DAG using
res1_dags <- list()
for(i in 1:nrow(res1$dags)){
  res1_dags[[i]] <- t(matrix(res1$dags[i,],4,4,byrow = TRUE))
}
# Notice we have to transpose the adjacency matrix here for qgraph!

# We can plot each of these just as we did above
par(mfrow = c(1,3))
for(i in 1:3){
  qgraph(res1_dags[[i]], labels = names, layout = layout, directed = TRUE, asize = 8, vsize = 15)
}

We can use the ida() function to estimate the effect of an intervention according to each of the DAGs in the Markov Equivalence set. To do this we need to provide the output of the pc function and the covariance matrix of the data. So, to find the average causal effect of $X_1$ on $X_4$ (that is, the contrast between $E[Y | do(X = 1)]$ and $E[Y | do(X = 0)]$) we would use the code

Code

ida(1,4,cov(data), pc_fit1@graph, verbose = TRUE)



x= 1 y= 4 
pa1=  
pa2= 2 3 
Fit - y: 4 x: 1 |b.hat= 1.59166 
Fit - y: 4 x: 1 2 |b.hat= 2.587346 
Fit - y: 4 x: 1 3 |b.hat= -0.9988204

[1]  1.5916598  2.5873460 -0.9988204

This output gives us three numbers, each corresponding to an estimate of the effect of this intervention in one of the three DAGs in the Markov Equivalence set. By specifying verbose = TRUE we can see what regressions each number comes from. Compare this to the results you obtained from manually calculating this in the previous exercise - you should see you get approximately the same results. Differences come from the fact that we use raw data above, and only summary statistics here. (Note that the order of the effects does not necessarily correspond to the order in which the DAGs are plotted above!)

Now that you’ve seen how the pcalg package works, let’s try it out on a new example. Suppose that you know the true DAG is given by the following graph

Code

names <- LETTERS[1:4]
adjmat2 <- matrix(c(0,0,1,0,
                   0,0,1,0,
                   0,0,0,0,
                   0,1,1,0),4,4,byrow = TRUE)
lay2 <- matrix(c(-1,1,
                 .5,1,
                  0,0,
                  1,0),4,2,byrow = TRUE)
qgraph(adjmat2, labels = names, layout = lay2, directed = TRUE, asize = 8, vsize = 15)

Exercise

What do you expect the PC algorithm to be able to recover? What is the CPDAG for this DAG?

Code

# You can work this out by hand, or ``cheat'' by using the pcalg package
adjmat2 <- matrix(c(0,0,1,0,
                   0,0,1,0,
                   0,0,0,0,
                   0,1,1,0),4,4,byrow = TRUE)
colnames(adjmat2) <- rownames(adjmat2) <- names
g2 <- as(adjmat2,"graphNEL") # convert to graphNEL object for pcalg
# This function converts a DAG to a CPDAG
cpdag2 <- dag2cpdag(g2)

Answers

Code

plot(cpdag2)

Suppose that we have the following data, generated according to the above DAG

Code

set.seed(1234)
n <- 3000

A <- rnorm(n)
D <- rnorm(n)
B <- 0.50 * D + rnorm(n)
C <- -0.75 * A + B + D +  rnorm(n)

data2 <- cbind(A, B, C, D)

Exercise 9

Use the pcalg package to estimate the CPDAG. Do you obtain the correct CPDAG?

Code

suffStat2 <- list(C = cor(data2), n = nrow(data2))
pc_fit2 <- pc(suffStat = suffStat2, indepTest = gaussCItest,
p = ncol(data), alpha = 0.01)

Answers

Code

plot(pc_fit2, main = "Inferred CPDAG using pcalg")

Exercise 10

Estimate the effect on $C$ of an intervention to increase $A$ by one unit. Then, estimate the effect on $C$ of an intervention to increase $D$ by one unit. Use the ida() function as we demonstrated above

Answers

Code

 # Part 1: Effect of do(A) on C
ida(1,3,cov(data2), pc_fit2@graph, verbose = TRUE)



x= 1 y= 3 
pa1=  
pa2=  
Fit - y: 3 x: 1 |b.hat= -0.7318286

[1] -0.7318286

Code

# Part 2: Effect of do(D) on C
ida(4,3,cov(data2), pc_fit2@graph, verbose = TRUE)



x= 4 y= 3 
pa1=  
pa2= 2 
Fit - y: 3 x: 4 |b.hat= 1.480331 
Fit - y: 3 x: 4 2 |b.hat= 1.011716

[1] 1.480331 1.011716

Just as we saw in the previous answer, we are typically uncertain about the effect of interventions,because we are uncertain about the causal structure. Here however, we see the power of identifying collider structures. In this case we have identified that A -> C, so we only get a single estimate for the effect of an intervention on A. However, we are not certain whether D has only a direct effect or also an indirect effect on A. Because we don’t know whether B -> D or D -> B. So, we get two different estimates of the intervention effect.

Restricted SCMs

In the lecture we learned about a second strategy for causal discovery which involves restricting the causal model in some way. The basic principle is that, if we assume either a) non-Gaussian additive noise or b) non-linear relationships with additive noise, then we can discover the causal direction of arrows in the DAG. When we have two variables $X$ and $Y$ the basic approach is to fit a regression model in both ``directions’’ and then test for independence of the error and the predictor of the model.

Below we will get some practice with applying these techniques to bivariate systems, so called `cause-effect pairs’. We will also see that causal discovery based on linear but non-Gaussian models can be easily scaled up to the multivariate case using the so called LiNGAM (Linear Non-Gaussian Acyclic Models) algorithm.

Non-Gaussian Cause-Effect Pairs

Let’s start by reproducing the example given in the lecture slides, where the noise distribution of $Y$ is given by a uniform distribution $U(lb, ub)$. In a uniform distribution, all values between the lower bound $lb$ and upper-bound $ub$ are equally likely. For example, a uniform distribution $U(-1,1)$ looks like this (with 1000 samples):

Code

set.seed(1234)
n <- 5000
noise <- runif(n,-1,1)
hist(noise, prob = T)

Exercise 11

Generate data from the SCM: $Y$ $:=$ $X$ $+$ $\epsilon$ where $X$ and $\epsilon$ are drawn from a uniform distribution, as given above.

Code

set.seed(1234)
n <- 5000
noise <- runif(n,-1,1)
X <- runif(n, -1,1)
Y = X + noise

Exercise 12

Fit a linear regression model in the correct causal direction ($Y$ predicted by $X$) and the incorrect causal direction ($X$ predicted by $Y$). Save the residuals for each model. Recreate the figures from the lecture. That is, for each model, plot a) the predictor against the outcome variable and b) the predictor variable against the residuals. What do you notice?

Code

library(dHSIC)
myx <- lm(Y~X)
mxy <-  lm(X~Y)

# save residuals
myx_r <- myx$residuals
mxy_r <- mxy$residuals

Answers

Code

# First, for the "correct" causal model Y = X + epsilon
par(mfrow=c(1,2))
plot(X, Y, col = "blue", xlab = "X", ylab = "Y", ylim = c(-2,2), xlim = c(-2,2), lwd = 2)
abline(myx, col = "red")
abline(h = 0)
abline(v = 0)

plot(X,myx$residuals, col = "red", ylab = expression(epsilon[Y]))
abline(h = 0)
abline(v = 0)

Code

# Second, for the incorrect causal model X = Y + epsilon
par(mfrow=c(1,2))
plot(Y, X, col = "blue", xlab = "Y", ylab = "X", ylim = c(-2,2), xlim = c(-2,2), lwd = 2)
abline(mxy, col = "red")
abline(h = 0)
abline(v = 0)

plot(Y,mxy$residuals, col = "red", ylab = expression(epsilon[X]))
abline(h = 0)
abline(v = 0)

Exercise 13

We now want to try and learn the causal direction from our simulated data. We can this by testing, for each model, whether the predictor variable is independent of the residuals. For this you can use the dhsic.test function from the package dHSIC that we used in last weeks lab. Do you recover the correct causal direction?

Answers

Code

dhsic.test(X,myx_r,method="gamma")

$statistic
[1] 0.2065139

$crit.value
[1] 0.6593727

$p.value
[1] 0.7236867

$time
GramMat   dHSIC CritVal 
  0.528   0.528   0.244 

$bandwidth
[1] 0.4080292 0.4062690

Code

dhsic.test(Y,mxy_r,method="gamma")

$statistic
[1] 35.63961

$crit.value
[1] 0.5956935

$p.value
[1] 1.336477e-218

$time
GramMat   dHSIC CritVal 
  0.341   0.341   0.344 

$bandwidth
[1] 0.5716613 0.2854028

We fail to reject the null hypothesis X ind $\epsilon_y$. We reject the null hypothesis Y ind $\epsilon_x$. Hence, the model where the predictor is independent of the error is Y = X + $\epsilon$. So, we recover the correct causal direction

Exercise 14

Repeat the above process, but now choosing your own distribution for the error. Try first generating data from a distribution and plotting it to understand the distribution. You can try, for example, chi-squared rchisq, exponential rexp or gamma distributions rgamma amongst many others. See the information on Distributions in the stats package for detail, or even start with a normal distribution, and apply some transformations (such as abs()) to make it non-normal. What do you find?

Non-Linear Cause-Effect Pairs

Now let’s get some practice working with non-linear SCMs. Let’s try and recreate the example shown in the lecture slides. The process you’ll take is very similar to what we did in the previous question. This time, let’s use Gaussian noise, but have Y be a non-linear function of X. In the lecture we used the causal model

\[ \begin{align*} X &\sim \mathcal{N}(0,1.5^2) \\ Y &:= sin(X) + \epsilon_Y \quad \quad \epsilon_Y \sim {N}(0,0.2^2) \end{align*} \]

In the previous exercise we tried to determine causal direction using linear regression, but that’s obviously not suitable here. We want to put ourselves into the shoes of the naive researcher: Let’s imagine we know that either $X \rightarrow Y$ or $X \leftarrow Y$, and we know, whichever is true, the causal relationship will be non-linear. But, we don’t know exactly what that non-linear function is. So, we want to use a regression method that allows for any kind of non-linear relationship to be modelled.

Generalized Additive Models (GAMs) are the perfect tool for this purpose: They allow the outcome variable $Y$ to be any smooth function of the predictor variable(s) $X$. For details see chapter 7.7 of James, Witten, Hastie & Tibshirani (2018) Introduction to Statistical Learning. For our purposes here it suffices to know that we can fit a GAM to our data using gam(Y ∼ s(X)) from the package mgcv; the residuals can be accessed by gam(Y ∼ s(X))$residuals.

Exercise 15

First, generate data from the SCM described above

Code

set.seed(1234)
n <- 1000

X <- rnorm(n,0,1.5)
Y <- sin(X) + rnorm(1000,0,.2)

Exercise 16

Now, fit a GAM regression model in both the correct causal direction ($Y$ predicted by $X$) and the incorrect causal direction ($X$ predicted by $Y$). Save the residuals

Code

library(mgcv)

Loading required package: nlme

This is mgcv 1.9-1. For overview type 'help("mgcv-package")'.

Code

gyx <- mgcv::gam(Y ~ s(X))
gxy <- mgcv::gam(X ~ s(Y))

# save residuals
gyx_r <- gyx$residuals
gxy_r <- gxy$residuals

Exercise 16

Recreate the figures from the lecture. That is, for each model, plot a) the predictor against the outcome variable and b) the predictor variable against the residuals. What do you notice? Tip: First, plot the gam() model object itself, which will display the estimated non-linear function. Then plot the data and/or residuals using points(). What do you notice?

Code

library(mgcv)

gyx <- mgcv::gam(Y ~ s(X))
gxy <- mgcv::gam(X ~ s(Y))

# save residuals
gyx_r <- gyx$residuals
gxy_r <- gxy$residuals

Answers

Code

par(mfrow=c(1,2))

plot(gyx, rug = F, xlab = "X", ylab = "Y",  se = F, ylim = c(-2,2), xlim = c(-4.5,4.5), col = "red", lwd = 2)
points(X, Y, col = "blue")
abline(h = 0)
abline(v = 0)

plot(X,gyx_r , col = "red", ylab = expression(epsilon[Y]))
abline(h = 0)
abline(v = 0)

Code

par(mfrow = c(1,2))
plot(gxy, ylab = "X", xlab = "Y", rug = F, se = F, xlim = c(-2,2), ylim = c(-4.5,4.5), col = "red", lwd = 2)
points(Y, X, col = "blue")
abline(h = 0)
abline(v = 0)

plot(Y,gxy_r, col = "red", ylab = expression(epsilon[X]))
abline(h = 0)
abline(v = 0)

Exercise 17

Answers

Code

dhsic.test(X,gyx$residuals,method="gamma")

$statistic
[1] 0.194487

$crit.value
[1] 0.5780122

$p.value
[1] 0.8308409

$time
GramMat   dHSIC CritVal 
  0.015   0.009   0.008 

$bandwidth
[1] 0.9920256 0.1329000

Code

dhsic.test(Y,gxy$residuals,method="gamma")

$statistic
[1] 7.734531

$crit.value
[1] 0.5782414

$p.value
[1] 3.494625e-50

$time
GramMat   dHSIC CritVal 
  0.015   0.008   0.010 

$bandwidth
[1] 0.5275154 0.3873251

We fail to reject the null hypothesis X ind $\epsilon_y$. We reject the null hypothesis Y ind $\epsilon_x$. Hence, the model where predictor ind error is Y = X + $\epsilon$. So, we recover the correct causal direction.

Try repeating the above process, but now choosing your own distribution for the error. For example, try using $X^3$ or $e^X$. Play around with making functions that are “closer” to linear than others!

Bonus I: LiNGAM - Multivariate Linear Non-Gaussian Models (Bonus)

The LiNGAM algorithm generalizes discovery of causal direction in linear Non-Gaussian models to the multivariate case. It turns out that, if our causal system is linear, non-Gaussian and acyclic, then there is typically only one DAG that implies independence between all error terms. The LiNGAM method is implemented in the pcalg package, function lingam().

Here we provide you with a simulated dataset which shares the same DAG as the simulated data in Exercise 2.1. In this case, however, the noise distributions are all uniform.

Code

data2 <- readRDS("data_cd_ex2.RDS")

Exercise

Use the lingam() function to estimate the DAG structure. What do you notice?

Code

lin_fit <- pcalg::lingam(data2)
adjlin <- as(lin_fit, "amat")

# very hacky way of transforming amat to a numeric matrix
adjmat2 <- apply(adjlin,c(1,2),isTRUE) + 0

Answers

LiNGAM succeeds in finding the data-generating DAG! No equivalence class needed

Code

qgraph(adjmat2, layout = layout)

Bonus II: Empirical Cause-Effect Pairs

So far we worked only with simulated cause-effect pairs. But researchers at the Unversity of Tuebingen have actually collected a little over 100 empirical datasets of real-life cause-effect pairs. In each case the true causal model is (thought to be) known. Some example pairs are (first cause, then effect):

Altitude and Temperature
Age and Height
$CO_2$- Emissions and energy use
Employment and Population

The following code downloads the above four datasets from the database

Code

url <- 'https://webdav.tuebingen.mpg.de/cause-effect/'

temp <- read.table(paste0(url, 'pair0001.txt'), col.names = c('altitude', 'temperature'))
age <- read.table(paste0(url, 'pair0022.txt'), col.names = c('age', 'height'))
co2 <- read.table(paste0(url, 'pair0073.txt'), col.names = c('co2', 'energy'))
pop <- read.table(paste0(url, 'pair0084.txt'), col.names = c('employment', 'population'))

Try to find the causal direction in each of the four example datasets. In each case, try out using the linear non-Gaussian approach (you can use the lingam() function or the approach in earlier exercises) as well as the non-linear causal model approach which uses gam(). Are there situations where one works better than the other? You can download more examples and check the answers on the database website.

Invariant Causal Prediction (Multiple Environments)

In this part of the practical you will get some hands on experience using Invariant Causal Prediction (ICP). ICP is different from the techniques we discussed previously in the lab in several ways.

First, ICP needs data taken from different environments. We will focus on the situation where we have both observational data and data from a setting where the causal system is undergoing an intervention or interventions of some kind. Second, ICP aims to recover only part of the causal DAG. We decide on a target variable $Y$, and then try to learn what variables are direct causes of $Y$. ICP works by looking for those conditional dependencies (i.e. predictive relationships) that stay the same across environments. There are extensions that can deal with learning the whole graph, but we will focus on this simpler case for now.

There are some important things to keep in mind when using ICP. It’s important to remember that, although we have often discussed interventions that set a variable to a constant value (i.e. $do(X = 1)$), it is possible to define many other types of interventions. For example, we can imagine that instead of forcing everyone to take a single aspirin tablet, we might randomly assign people to take some amount of aspirin according to a certain distribution. We can imagine other interventions which increase the mean of a variable, or how it reacts to other variables in the system (e.g. encouraging but not forcing everyone to take an aspirin, or to take an aspirin as soon as they feel any pain at all!). The neat thing about ICP is that we don’t need to know what variables were intervened on, or what those interventions are. The key thing is, however, that these interventions shouldn’t act directly on the target variable $Y$. For example, if we randomly assign $Y$ values, then, in the intervention environment, nothing will predict $Y$ - we will have broken the direct causal links by random assignment.

With this refresher in place, let’s take a look at an example. We provide you with data (data_cd_ex3.RDS) simulated from a four-variable SCM under two conditions here.

Code

icpdata <-  readRDS("data_cd_ex3.RDS")

The first 400 rows are observational data. The last 500 rows come from a setting some interventions are applied to the system. The environment (observational or interventional) is denoted by the variable in the fifth column ExpInd. The DAG of the SCM is shown below

Code

adj <- matrix(
  c(0,0,0,0,
    1,0,0,0,
    1,0,0,0,
    0,1,1,0), 4,4, byrow = T
)
names <- c("X1","X2","Y","X3")

qgraph(t(adj), layout = 
         matrix(c(0,1,
                  -1,0,
                  1,0,
                  0,-1),4,2,byrow = T),
       labels = names)

Take it that we are interested in discovering which variables are direct causes of the variable $Y$.

Exercise 18

Use the ICP() function from the package InvariantCausalPrediction to try and recover the direct causes of $Y$. You’ll need to supply a matrix of possible predictor variables $X$, the outcome variable of interest $Y$ and an indicator variable $ExpInd$. Otherwise, use the default settings. Check what variables show a significant causal effect using the summary() of the ICP output. Otherwise, look at the acceptedSets and pick out the variable (if any) that appears in all sets of predictors.

Code

library(InvariantCausalPrediction)
icp_out <- ICP(X = icpdata[,c("X1","X2","X3")], Y = icpdata[,"Y"], ExpInd = icpdata[,"ExpInd"])


 accepted set of variables 1
 *** 25% complete: tested 2 of 8 sets of variables 
 accepted set of variables 1,2
 accepted set of variables 1,2,3

Answers

There are three regression models whose parameters stay the same across environments. The model with X1 predictor, X1 and X2 as predictors, and X1 X2 and X3 as predictors. From the slides we take the so-called intersection - this just means we look at what variable is present in all of these models. That’s X1!

Code

icp_out$acceptedSets

[[1]]
[1] 1

[[2]]
[1] 1 2

[[3]]
[1] 1 2 3

Code

summary(icp_out)


 Invariant Linear Causal Regression at level 0.01 (including multiplicity correction for the number of variables)
 Variable X1 shows a significant causal effect
 
     LOWER BOUND  UPPER BOUND  MAXIMIN EFFECT  P-VALUE   
X1         0.07         0.62            0.07   0.0075 **
X2        -0.12         0.44            0.00   1.0000   
X3        -2.05         0.00            0.00   1.0000   
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The summary is just another way of looking at the same output - we think X1 is the cause of Y. We also get an estimate of the strength of that causal relation. We also get an estimate of the strength of that causal relation. We correctly recover that X1 is the only direct cause of Y.

Exercise 19

We can perform a kind of ICP method ourselves using linear regression. Let’s not look at all possible regression models like the ICP method does, but instead just look at two: A) the regression of $Y$ on its direct cause $X1$ and B) the regression of $Y$ on its non-cause $X2$. Perform each regression model seperately in each environment, and inspect how the estimated parameters of each change or do not change across environments

Code

icpdata <- as.data.frame(icpdata)
a1 <- lm(Y ~ X1, data = icpdata, subset = (icpdata[,"ExpInd"]== 1))
a2 <- lm(Y ~ X1, data = icpdata, subset = (icpdata[,"ExpInd"]== 2))

b1 <- lm(Y ~ X2, data = icpdata, subset = (icpdata[,"ExpInd"]== 1))
b2 <- lm(Y ~ X2, data = icpdata, subset = (icpdata[,"ExpInd"]== 2))

Answers

Code

print(a1); print(a2)


Call:
lm(formula = Y ~ X1, data = icpdata, subset = (icpdata[, "ExpInd"] == 
    1))

Coefficients:
(Intercept)           X1  
    0.01962      0.50835


Call:
lm(formula = Y ~ X1, data = icpdata, subset = (icpdata[, "ExpInd"] == 
    2))

Coefficients:
(Intercept)           X1  
    0.02994      0.50756

For the direct cause variable, the parameters are very similar across these two models

Code

print(b1); print(b2)


Call:
lm(formula = Y ~ X2, data = icpdata, subset = (icpdata[, "ExpInd"] == 
    1))

Coefficients:
(Intercept)           X2  
     0.0249       0.2950


Call:
lm(formula = Y ~ X2, data = icpdata, subset = (icpdata[, "ExpInd"] == 
    2))

Coefficients:
(Intercept)           X2  
    0.54860     -0.05284

For the non-causal variable X2, the parameters change quite a bit. This is a somewhat extreme example, but this is the basic principle!

Below I give you the code used to generate the observational and intervention data. Notice what those interventions are - an intervention to change the mean of $X1$ and an intervention to randomly assign values of $X2$. Try to play around with your own interventions and test out what ICP can give you. If one method doesn’t work, try another - some of the tests for invariance are more suitable for some types of systems and interventions than others (see the section test under ?ICP)

Code

n1 <- 400
n2 <- 500
ExpInd <- c(rep(1,n1), rep(2,n2))

set.seed(123)
# In the observational setting, X1 has a mean of zero
X1o <- rnorm(n1, 0, 1)
# In the intervention setting, X1 has a mean of 1
X1i <- 1 + rnorm(n2,0,1)

X1 <- c(X1o, X1i)

# Y is caused by X1
Y <- 0.5* X1 +  rnorm(n1 + n2, 0, 1)

# Observational: X2 is caused by X1
X2o <- 1.5 * X1[1:n1] + rnorm(n1, 0,.5)

# Intervention: X2 is randomly assigned
X2i <- rnorm(n2,0,.5)

X2 <- c(X2o, X2i)

# X3 is caused by Y and X2
X3 <- -.4 * Y + .2* X2 + rnorm(n1 + n2,0,.2)

X <- cbind(X1, X2, X3)

icpdata <- cbind(X1,X2,X3,Y,ExpInd)

# ICP(X = icpdata[,c("X1","X2","X3")], Y = icpdata[,"Y"], ExpInd = icpdata[,"ExpInd"])

Additional Readings

Glymour, C., Zhang, K., & Spirtes, P. (2019). Review of causal discovery methods based on graphical models. Frontiers in Genetics, 10: 524. https://doi.org/10.3389/fgene.2019.00524

This paper covers two of the topics we considered in the lecture: 1) conditional-independence based methods (called constraint-based and score-based methods in the paper) and 2) restricted SCMS (called Functional Causal Models in the paper). It does not cover Invariant Causal Prediction (ICP) methods.

Invariant Causal Prediction is a relatively new technique, and so, tutorial-style papers are somewhat lacking. The paper introducing this technique is given below. Below we also provide a youtube link to a talk by Jonas Peters (the developer of ICP) which may be easier to follow.

Peters, J., Bühlmann, P., & Meinshausen, N. (2016). Causal inference by using invariant prediction: identification and confidence intervals. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 947-1012.

www.youtube.com/watch?v=ijB2odErYLI&ab_channel=CenterforCausalDiscovery