Day 1 Practical: Potential Outcomes

Part 1: Potential Outcomes and Identifying Causal Effects

Author

Oisin Ryan

In this practical, we’ll get to grips with the basics of causal identifiability assumptions.

Since this is the first practical, make sure you have had a quick look at our setup guide.

As for all practicals in this course, we encourage you to work in groups and discuss amongst yourselves. This first part will be a little abstract, while we get closer to real data applications as we go on.

Tip

The answers to each exercise are available as a collapsed code block. Try to work out the answer yourself before looking at this code block!

Potential Outcomes and Identifying Causal Effects

In the lecture we discussed how the potential outcomes framework defines causal effects and causal inference as involving potential versions of some outcome variable of interest. In the real world, we don’t have access to these potential versions of the outcome variable, and so, we must understand in what situations we can still make inferences about, for instance, the differences between those potential outcomes.

Let’s imagine that we are a team of researchers interested in the effectiveness of different weight-loss interventions. Download the dataset called potential_outcomes_d1.rds from here and load it into your R environment.

Code
po_data <- readRDS("potential_outcomes_d1.rds")
head(po_data[,c("Y_diet","Y_surgery","Y_control")])

For now, let’s leave the real world behind! This dataset consists of hypothetical (i.e. simulated-by-us) values of three potential outcomes for a hypothetical population of 10,000 adults. Note, we are imagining here that we can somehow observe directly all potential outcomes for all participants (this can never happen in the real world).

In each case the variable Y represents the amount of weight change (in kg) three months after beginning a particular intervention. Negative values reflect weight loss.

  • Y_diet represents the potential outcome, for each individual, of a diet intervention, where individuals are given a meal plan every day to reduce their calory intake.
  • Y_surgery represents the potential outcomes under a gastric band surgery intervention.
  • Y_control represents the outcome under a control (i.e. no) intervention.

We also supply an id identifier columns, and some other information we will use in a later exercise. For now, let’s compute some causal effects!

Causal Effects

Exercise 1

Use the potential outcomes provided to calculate the individual causal effect (ICE) of receiving a diet intervention in comparison to receiving the control intervention (that is, for each individual). Then compute the average causal effect. What effect does dieting have on weight-loss in comparison to no intervention?

Code
ICE_diet_control <- po_data$Y_diet - po_data$Y_control
ACE_diet_control <- mean(ICE_diet_control)
Exercise 2

Use the same approach to calculate the individual causal effect of receiving the surgery intervention in comparison to the control intervention. Do these interventions have different ACE values?

Code
ICE_surgery_control <- po_data$Y_surgery - po_data$Y_control
ACE_surgery_control <- mean(ICE_surgery_control)

Yes, it seems that the diet intervention on average produces a weight loss of 3 kg, while the surgery intervention on average produces a weight gain of 1kg

Code
message("The average causal effect of diet vs control is:",ACE_diet_control)
The average causal effect of diet vs control is:-2.9985072
Code
message("The average causal effect of surgery vs control is:",ACE_surgery_control)
The average causal effect of surgery vs control is:1.0016405

Identifiability Assumptions 1

For now, let’s focus on the diet and control interventions. Based on the previous exercises we now know the true causal effect. But of course, in real life, we don’t have access to both versions of the potential outcome for every individual, and so we must try to infer or estimate the causal effect of interest based on information we do have.

Let’s re-connect to the real world. Let’s imagine that we can run a study in which participants can choose whether to engage in the dieting intervention or the control (i.e. no-)intervention. We are able to enroll our entire population, and observe their weight-change 3 months after choosing their intervention.

In the dataset we have provided to you, this information is recorded in two columns:

  • Treatment_Received; records whether participants engaged in diet or control
  • Y_observed: records the weight-change value we observe after three months of following the intervention plan.
Exercise 3

Try to estimate the causal effect of the intervention based on observed data. Use a naive estimate: The difference in observed weight change values in treated (diet) vs non-treated (control) individuals. Does this match what you calculated in Exercise 1?

Code
# there are many ways that you could try to estimate the causal effect
# here we simply calculate the group difference (you may also use a t-test, etc.)
mean_Yobs_diet <- mean(po_data$Y_observed[po_data$Treatment_Received == "diet"]) 
mean_Yobs_control <- mean(po_data$Y_observed[po_data$Treatment_Received == "control"])
est <- mean_Yobs_diet - mean_Yobs_control

You’ll see that our estimate of the effect is quite different from the true causal effect we computed earlier. Of course, while these numbers may not match exactly in practice because of sampling variability, in this situation, we know that the observed group differences is a biased estimator of the true causal effect.

Let’s investigate where this bias is coming from by investigating whether our causal identifiability assumptions hold or not. To do this, we will again cheat by using the true potential outcomes. We do this so that you gain an understanding of what those identifiability assumptions really mean. In the real world, we often cannot test of verify whether these assumptions hold (as in all scientific endeavours, assumptions are ever present and ever important!)

Exercise 4

The assumption of exchangeability states that the potential outcomes should be independent of the actual treatment received. So, for instance, the distribution of Y_diet (the potential outcome variable recording each persons hypothetical response to diet) should not systematically differ between individuals who actually were assigned to the Diet condition and those who actually were assigned to the Control condition. Investigate whether this assumption holds, using the actual treatment received and the relevant potential outcomes

Code
# you can investigate the distributions in many ways, here we will compute means
# first, we compute the means of the dieting potential outcome, in both the "treatment received" groups
mean_po_diet_treated <- mean(po_data$Y_diet[po_data$Treatment_Received == "diet"]) 
mean_po_diet_control <- mean(po_data$Y_diet[po_data$Treatment_Received == "control"])

# here we do the same, but this time checking the control potential outcome 
mean_po_control_treated <- mean(po_data$Y_control[po_data$Treatment_Received == "diet"]) 
mean_po_control_control <- mean(po_data$Y_control[po_data$Treatment_Received == "control"])

Notice here again taht we are using the potential outcomes Y_diet and Y_control, and not the observed outcome which we would actually have in practice. We are testing for systematic differences between the Treatment_Received groups on either of those two variables.

By creating a simple 2-by-2 table we see that there are systematic differences. We can see that, on average, the potential outcomes under the control condition is about the same for individuals who did and did not actually receive treatment. However, this is not the case for the potential outcome under the diet treatment; here we can see that the mean of Y_diet is systematically lower for those who chose the diet treatment than for the controls. The diet treatment would have been much more effective for those who chose the control condition. This indicates that the groups are not exchangeable.

Code
data.frame("Mean PO Y_Diet" = c(mean_po_diet_treated,mean_po_diet_control),
           "Mean PO Y_Control" = c(mean_po_control_treated, mean_po_control_control),
           row.names = c("Received Diet","Received Control")
)

In our hypothetical study, we also collected background information on the participants. The research team thinks that whether the participants have been diagnosed with Diabetes or not is an important piece of background information; having diabetes may make participants more or less likely to choose the diet intervention, and it likely also has an effect on the effectiveness of any weight loss intervention.

Exercise 5

Investigate whether conditional exchangeability holds, that is, whether the potential outcomes are independent of actual treatment received conditional on diabetes. You can do this by repeating the same procedure as above, but doing this separately for individuals with (Diabetes == 1) and without (Diabetes == 0) diabetes.

Code
# we will compute means again, focusing on Y_diet
  # first for those without diabetes
mean_po_diet_treated_d0 <- mean(po_data$Y_diet[po_data$Treatment_Received == "diet" & po_data$Diabetes == 0]) 
mean_po_diet_control_d0 <- mean(po_data$Y_diet[po_data$Treatment_Received == "control" & po_data$Diabetes == 0])

  # then for those with diabetes
mean_po_diet_treated_d1 <- mean(po_data$Y_diet[po_data$Treatment_Received == "diet" & po_data$Diabetes == 1]) 
mean_po_diet_control_d1 <- mean(po_data$Y_diet[po_data$Treatment_Received == "control" & po_data$Diabetes == 1])

# put this in a new 2x2 table
dtab <- data.frame("Mean_Y_diet_NoDiabetes" = c(mean_po_diet_treated_d0,mean_po_diet_control_d0),
           "Mean_Y_diet_Diabetes" =  c(mean_po_diet_treated_d1,mean_po_diet_control_d1),
           row.names = c("Treated","Control")
)

This time we focus on the potential outcome Y_diet only, so the columns of this table represent those without and with diabetes respectively. As we can see the average potential outcome is equal between both groups within strata defined by Diabetes. In other words, conditional on diabetes, the potential outcome does not depend on the actual treatment received: We have conditional exchangeability.

Code
print(dtab)
        Mean_Y_diet_NoDiabetes Mean_Y_diet_Diabetes
Treated              -7.008220            -2.990204
Control              -6.997077            -3.009453

When we have conditional exchangeability given a covariate (and when the other identifiability assumptions hold), then we can obtain an unbiased estimate of the causal effect by comparing groups who are equal on that covariate. This is known as stratification. We’ll return to this in the later exercise of today, which deals with Estimation.

Let’s return again to the real world and try estimating the causal effect of interest under the assumption of conditional exchangeability.

Exercise 6

Estimate the average treatment effect among those with Diabetes. Repeat this for those without Diabetes. You now have two estimates, which represent conditional ATEs

Code
# compute the conditional means (note: here you could also use regression, ancova,
# or split the sample and compute t-tests etc.)
mean_treated_d0 <- mean(po_data$Y_observed[po_data$Treatment_Received == "diet" & po_data$Diabetes == 0]) 
mean_control_d0 <- mean(po_data$Y_observed[po_data$Treatment_Received == "control" & po_data$Diabetes == 0]) 

mean_treated_d1 <- mean(po_data$Y_observed[po_data$Treatment_Received == "diet" & po_data$Diabetes == 1]) 
mean_control_d1 <- mean(po_data$Y_observed[po_data$Treatment_Received == "control" & po_data$Diabetes == 1]) 
# compute mean in both
CATE_d0 <- mean_treated_d0 - mean_control_d0
CATE_d1 <- mean_treated_d1 - mean_control_d0
Exercise 7

To obtain an estimate of the Average Treatment Effect, we need to know what proportion of individuals have and do not have diabetes in our population. Compute this, and then take a weighted average of the conditional ATEs from above. For example, if the split is 50/50, then a regular (non-weighted) average should suffice! Compare your estimate of the ATE with the true ATE you computed earlier.

Code
# we can use "table" to get cell counts, then divide by totals (and round for display)
props <- round(
  table(po_data$Diabetes)/nrow(po_data),
  2)

ATE_estimate <- CATE_d0 * props[1] + CATE_d1* props[2]
# compute mean in both

The proportions are about 50/50 in the data. The conditional ATEs differ from one another, but when we take the appropriate weighted sum, we get again an estimate of around -3 kg for the ATE; we know that this is equal to the true ATE from an earlier exercise

Code
ATE_estimate
       0 
-2.99858 
Code
####

Identifiability Assumptions 2

From the last exercises, we can see the importance of the (conditional) exchangeability assumption. Conditional exchangeability is sometimes colloquially referred to as the “no (unobserved) confounding” assumption. While this assumption is obviously important, it is far from the only important causal identifiability assumption. Dealing with confounding is necessary for causal inference, but it is far from sufficient.

Let’s revisit the idea of performing a hypothetical study on weight loss interventions. This time, let’s imagine that we have access to the same population of individuals, but instead of performing a (non-randomized) dieting intervention, we are only capable of observing different behaviours that people engage in (for instance, by asking them questions, or by accessing medical records on those individuals).

In addition, let’s tweak the research question we used in the previous exercise. There we focused specifically on a dieting intervention, but this time our research team is a little less specific. Our research team wants to answer the question: What is the effect of engaging in weight-loss behaviours vs not engaging in any weight loss behaviours, on body weight, over a three month interval? We again collect information on Diabetes, which we think is the most important determinant of likelihood to engage in weight loss interventions, and which we believe determines the effectiveness of any such intervention.

Exercise 8

Before we look at any data, revisit the slides on causal identifiability assumptions, and your answers from the previous exercises. What assumptions should we be concerned about here, if any?

Download the dataset called potential_outcomes_d2.rds from here and load it into your R environment.

This represents a dataset that our study team could potentially collect from our population of interest. In the Treatment_Received column individuals are tagged as being in the treatment group if they engaged in any weight-loss behaviour or intervention, and control if they did not.

Exercise 9

Try to estimate the causal effect the researchers are interested in. Compare this to the causal effects we estimated in earlier exercises. What do you notice?

Code
data_obs <- readRDS("potential_outcomes_d2.rds")

mean_treated_d0 <- mean(data_obs$Y_observed[data_obs$Treatment_Received == "treated" & data_obs$Diabetes == 0]) 
mean_control_d0 <- mean(data_obs$Y_observed[data_obs$Treatment_Received == "control" & data_obs$Diabetes == 0]) 

mean_treated_d1 <- mean(data_obs$Y_observed[data_obs$Treatment_Received == "treated" & data_obs$Diabetes == 1]) 
mean_control_d1 <- mean(data_obs$Y_observed[data_obs$Treatment_Received == "control" & data_obs$Diabetes == 1]) 
# compute mean in both
CATE_d0 <- mean_treated_d0 - mean_control_d0
CATE_d1 <- mean_treated_d1 - mean_control_d0

ATE_estimate <- CATE_d0 * props[1] + CATE_d1* props[2]
####
Exercise 10

Recall that we are studying the same population of individuals all along. Take a sample of individuals who are in the treatment group in this dataset. Compare their Y_observed values to the different potential outcome values that are present in the first dataset we gave you. What do you notice?

Code
# identify which participants receive "treatment"
treated_ids <- data_obs[data_obs$Treatment_Received == "treated","id"]

# grab the potential outcomes of those individuals from po_data
# combine for easy reference
combined_data <- cbind(
  po_data[po_data$id %in% treated_ids,c("Y_diet","Y_surgery")],
  data_obs[data_obs$id %in% treated_ids,c("Y_observed","id","Treatment_Received")]
)
Exercise 11

Based on your answer to the previous question, what causal identifiability assumption is violated here?

In this situation, we can easily argue that the consistency assumption is violated: The treatment of interest is not well defined. This can sometimes seem like a vague assumption, so here we try to make this concrete. In this example, for some treated individuals, we observe the potential outcome Y_surgery while for others we observe the potential outcome Y_diet. So, we are not consistently observing either potential outcome of interest. The effect estimate is an uninterpretable blend of constrasts between entirely different potential outcomes and entirely different treatment effects.

Code
head(combined_data)

Imagine that we have three different research teams, each with the same vague research question. One research team observes a population where only dieting happens to be the only intervention anyone engages in; another observes a population where surgery is the popular option. Both research teams would come to entirely different and contradictory effect estimates and conclusions. Neither research team yields any practical or useful insights that can be used to guide real-life decision making.

In fact, what we have presented here is an optimistic scenario; in reality, there are thousands and thousands of distinct weight-loss interventions that individuals could engage in (and so, we would observe a mix of all of those potential outcomes in the best-case scenario, not just two!).

Exercise 12 (Discussion)

Think about the other assumption that makes up SUTVA: no interference. Can you think of a situation where “no interference” might by violated in your own research area? Discuss with your partner(s)!