Day 1 Practical: Potential Outcomes

Part 2: Estimating Causal Effects

Author

Oisin Ryan

In this practical, we’ll get to grips with the basics of estimating causal effects using stratification and matching. We will focus on the example datasets from the first practical, where we deal only with the case of a single confounding variable. The bonus exercises of today work through a larger multivariate example. Please take the practicals at your own pace - if you feel like you know what stratification and matching is already, feel free to skip to the bonus exercise.

Tip

The answers to each exercise are available as a collapsed code block. Try to work out the answer yourself before looking at this code block!

Estimating Causal Effects

Let’s get some practice with estimating causal effects using the same practice datasets from earlier, based on a hypothetical study of different weight-loss interventions. Load potential_outcomes_d1.rds (which you dowloaded earlier form here into your R environment.

Recall we discussed running a study in which participants can choose whether to engage in the dieting intervention or the control (i.e. no-)intervention. We are able to enroll our entire population, and observe their weight-change 3 months after choosing their intervention. For this exercise we need the information in the following columns:

  • Treatment_Received; records whether participants engaged in diet or control
  • Y_observed: records the weight-change value we observe after three months of following the chosen intervention plan.
  • Diabetes: records whether the participants have a prior diagnosis of diabetes or not
Code
po_data <- readRDS("potential_outcomes_d1.rds")
head(po_data[,c("Treatment_Received","Y_observed","Diabetes")])

Recall that in the earlier exercise we established that, while exchangeability was violated, the researchers suspected that this was due to the variable Diabetes: having diabetes may make participants more or less likely to choose the diet intervention, and it likely also has an effect on the effectiveness of any weight loss intervention. In other words, the research team may be happy to assume that the treatment groups are exchangeable conditional on diabetes status.

Checking covariate balance

Using the data, we can perform a quick descriptive analysis to determine whether, in fact, the distribution of Diabetes is different between treated and control units. If the researchers are correct and those with Diabetes are less likely to choose one of the interventions than the other, we should be able to see this in the data. This is also known as checking covariate balance between groups

Exercise 1

Check whether the diet and control groups appear to differ in the proportion of those with or without diabetes. You can do this with a table, by computing mean differences in each group, or by visualizing!

We can check this in a number of ways. For instance, by plotting this histograms of the observed variables across dieting groups, for example below, where blue is the diet group and red is control, or by simply computing a cross-table. Alternatively, you can compute standardized mean differences (SMDs) between the two groups on any covariate(s) of interest. We will see this in the bonus exercises, for those who are interested - for now let’s stick to plots and cross-tables.

Code
p1 <- hist(po_data$Diabetes[po_data$Treatment_Received == "diet"], plot = F)
p2 <- hist(po_data$Diabetes[po_data$Treatment_Received == "control"], plot = F)
plot( p1, col=rgb(0,0,1,1/4), xlim=c(0,1), main = "Covariate distributions across groups",
      xlab = "Diabetes")  # first histogram
plot( p2, col=rgb(1,0,0,1/4), xlim=c(0,1), add=T)  # second

Code
# cross table() converted to proportions
  # setting margin = 1 scales by the "rows" of the table,
  # i.e, gives the proportions in each treatment group
prop.table(table(po_data$Treatment_Received, po_data$Diabetes),1)
         
                  0         1
  control 0.7993198 0.2006802
  diet    0.2018994 0.7981006

Here we can see that in the control condition, around 20 percent of individuals have diabetes, while in the diet condition, around 80 percent have diabetes. Those with diabetes are much more likely to select the diet treatment than the control treatment, and vice versa!

When we have conditional exchangeability given a covariate (and when the other identifiability assumptions hold), then we can obtain an unbiased estimate of the causal effect in a number of different ways.

Stratification

One way to estimate the average causal effect in this situation is to compare groups who are equal in value on that covariate. This is known as stratification: We split the population up into strata or subpopulations (that is, we condition on the covariate taking on a certain value), estimate group differences within those strata, and then combine those estimates again to get an estimate of the (average) causal effect in the population as a whole.

Let’s see this in action.

Exercise 2

First, estimate the average treatment effect among those individuals with Diabetes. This is sometimes known as a Conditional Average Treatment Effect or CATE; the ACE/ATE conditional on a variable taking on a pre-specified value. To do this, select members of the control and treatment group with Diabetes == 1, and compute the difference in means. Repeat this process for those without diabetes (Diabetes == 0). You should end up with two different CATE estimates, one for each group.

Code
# compute the conditional means (note: here you could also use regression, ancova,
# or split the sample and compute t-tests etc.)
mean_treated_d0 <- mean(po_data$Y_observed[po_data$Treatment_Received == "diet" & po_data$Diabetes == 0]) 
mean_control_d0 <- mean(po_data$Y_observed[po_data$Treatment_Received == "control" & po_data$Diabetes == 0]) 

mean_treated_d1 <- mean(po_data$Y_observed[po_data$Treatment_Received == "diet" & po_data$Diabetes == 1]) 
mean_control_d1 <- mean(po_data$Y_observed[po_data$Treatment_Received == "control" & po_data$Diabetes == 1]) 
# compute mean in both
CATE_d0 <- mean_treated_d0 - mean_control_d0
CATE_d1 <- mean_treated_d1 - mean_control_d1
Exercise 3

To obtain an estimate of the Average Causal Effect (that is the average over the population as a whole), we need to know what proportion of individuals have vs do not have diabetes in our population, in order to combine our CATE estimates from above. Calculate this using the Diabetes variable. Then, take a weighted average of the conditional ATEs from above:

\[ ATE = CATE(Diabetes = 1) * p(Diabeters = 1) + CATE(Diabetes = 0) * p(Diabetes = 0) \]

Code
# we can use "table" to get cell counts, then divide by totals (and round for display)
props <- round(
  table(po_data$Diabetes)/nrow(po_data),
  2)

ATE_estimate <- CATE_d0 * props[1] + CATE_d1* props[2]
# compute mean in both

After you’ve done this, compare your estimate of the average causal effect to the true causal effect from the previous practical session (parameter ACE_diet_control).

The proportions are about 50/50 in the data. The conditional ATEs differ from one another, but when we take the appropriate weighted sum, we get again an estimate of around -3 kg for the ATE; we know that this is equal to the true ATE from an earlier exercise

Code
ATE_estimate
        0 
-2.998951 
Code
####

Let’s return to the example of the second research team from earlier, using the dataset potential_outcomes_d2.rds from here. Recall that this related to a study team who wants to answer the question: What is the effect of engaging in weight-loss behaviours vs not engaging in any weight loss behaviours, on body weight, over a three month interval? In the Treatment_Received column individuals are tagged as being in the treatment group if they engaged in any weight-loss behaviour or intervention, and control if they did not.

Exercise 4

Try to estimate the causal effect the researchers are interested in. Compare this to the causal effects we estimated the earlier practical. What do you notice? Is there an estimation problem here, or an identification problem?

Code
data_obs <- readRDS("potential_outcomes_d2.rds")

mean_treated_d0 <- mean(data_obs$Y_observed[data_obs$Treatment_Received == "treated" & data_obs$Diabetes == 0]) 
mean_control_d0 <- mean(data_obs$Y_observed[data_obs$Treatment_Received == "control" & data_obs$Diabetes == 0]) 

mean_treated_d1 <- mean(data_obs$Y_observed[data_obs$Treatment_Received == "treated" & data_obs$Diabetes == 1]) 
mean_control_d1 <- mean(data_obs$Y_observed[data_obs$Treatment_Received == "control" & data_obs$Diabetes == 1]) 
# compute mean in both
CATE_d0_b <- mean_treated_d0 - mean_control_d0
CATE_d1_b <- mean_treated_d1 - mean_control_d0

ATE_estimate_d2 <- CATE_d0_b * props[1] + CATE_d1_b* props[2]
####

Here, the problem is not with the estimation procedure itself, but rather that what we are trying to estimate, the estimand, is not clearly defined. Confounder adjustment cannot help us out of a situation where the causal inference task is ill-defined or where other assumptions are violated.

Matching

Matching as a method of covariate adjustment is conceptually similar to stratification. The basic approach is to search for people in both treatment groups that have the same score on the covariate or covariates you wish to adjust for (i.e., that have the same profile of covariate values). You then create a new dataset consisting of these matched pairs. After matching, the matched treated and control groups are guaranteed to be balanced on their covariate values: Every unit in the treated/control groups with diabetes is matched to (i.e. part of a pair with) a unit from the other group that also has diabetes, and vice versa for those without diabetes. Note that the same units may be matched multiple times; this allows us to account for the fact that, say, control units with diabetes are relatively more rare than treated units with diabetes.

There are many ways to perform matching, and many options which you can choose. In our case, we are performing exact matching on a single variable, so we supply you here with a simple function which will do the matching for you. There’s no need to understand the function in any detail. The key detail is that you must specify a direction for the matching, that is, whether you wish to find a control unit to match to every treated unit, or a treated unit to match to every control unit.

Code
# this function works by either finding a control unit for every treated unit,
  # or finding a treated unit for every control unit
  # the mechanism of matching is to loop over values of diabetes
  # if finding a control unit for every treated, we then sample with replacement rows of the control dataset who have the target value of diabetes
  # combining the treated unit data with the sampled control unit data gives us matched pairs
fast_match <- function(mydata, direction = c("treated_to_control", "control_to_treated")) {
  direction <- match.arg(direction)
  
  treated <- mydata[mydata$Treatment_Received == "diet", ]
  control <- mydata[mydata$Treatment_Received == "control", ]
  
  matched_list <- list()
  
  for (d in unique(mydata$Diabetes)) {
    if (direction == "treated_to_control") {
      from_group <- treated[treated$Diabetes == d, ]
      to_group <- control[control$Diabetes == d, ]
    } else if (direction == "control_to_treated") {
      from_group <- control[control$Diabetes == d, ]
      to_group <- treated[treated$Diabetes == d, ]
    }
    
    if (nrow(from_group) == 0 || nrow(to_group) == 0) next
    
    sampled_indices <- sample(1:nrow(to_group), size = nrow(from_group), replace = TRUE)
    matched_to <- to_group[sampled_indices, ]
    
    if (direction == "treated_to_control") {
      matched <- data.frame(
        treated_id = from_group$id,
        control_id = matched_to$id,
        diabetes = d,
        Y_treated = from_group$Y_observed,
        Y_control = matched_to$Y_observed
      )
    } else {
      matched <- data.frame(
        treated_id = matched_to$id,
        control_id = from_group$id,
        diabetes = d,
        Y_treated = matched_to$Y_observed,
        Y_control = from_group$Y_observed
      )
    }
    
    matched_list[[as.character(d)]] <- matched
  }
  
  do.call(rbind, matched_list)
}
Exercise 5

Use the function above to a) find a control unit match for every treated unit, AND b) find a treated unit match for every control unit (use the direction) argument. Combine these datasets of matched pairs together (e.g., using rbind). This dataset now consists of every person who was in the diet condition, matched to control unit with the same diabetes value AND every person who was in the control condition, matched to a treated unit with the same diabetes value. What do you notice about the distribution of diabetes in the dataset as whole?

Code
# Treated -> Control matches
matched_tc <- fast_match(po_data, direction = "treated_to_control")

# Control -> Treated matches
matched_ct <- fast_match(po_data, direction = "control_to_treated")

# Combine both
matched_all <- rbind(matched_tc, matched_ct)

props_allmatched_diabetes <- prop.table(table(matched_all$diabetes))
ATE_matched <- mean(matched_all$Y_treated - matched_all$Y_control)
Exercise 6

Estimate the ATE with this new dataset. Compare to your estimate of the ATE from the standardization method

Code
ATE_matched <- mean(matched_all$Y_treated - matched_all$Y_control)

Here, the data is given in the structure of matched pairs: each row represents a pair of treated and control individual, rather than the previous data structure, where each row was a seperate individual. In a way we can think about matching as creating a new, synthetic dataset, with twice as many individuals in the treatment and control groups, and where Diabetes is perfectly balanced between these groups. You can see this by checking the proportions with table(); half of the matched pairs have diabetes, and half of them don’t.

Code
props_allmatched_diabetes

      0       1 
0.50052 0.49948 

When we take the difference in means between all matched treated and control units we get an estimate which is very close to the stratification method. Since matching is random, if you run the procedure again with a different seed you’ll get a slightly different result, but all very close to the stratification estimate of the ATE.

Code
ATE_matched
[1] -2.996805

You can also explore, for instance, how often different individuals are matched to one another. This tells you something about the stability of your estimates; generally speaking, we would like to avoid that a single individual with some rare attributes gets matched very many times, as they would then have a lot of influence on the causal effect estimate. In this case, ith a single binary matching variable, we wouldn’t expect anything like this to happen.

Code
# Combine treated and control counts for a quick histogram
all_match_counts <- c(as.integer(table(matched_all$treated_id)),
                      as.integer(table(matched_all$control_id)))

hist(all_match_counts, breaks = 20,
     main = "Histogram of Times Units Are Matched",
     xlab = "Number of Times Matched",
     ylab = "Frequency",
     col = "skyblue")

Behind the scenes, what is happening here is actually very similar to stratification or weighting. When searching for matches for the treated unit, those relatively rare control individuals with diabetes are selected many times, giving them a higher weight in the analysis (i.e. selected as a match relatively many times) while the opposite happens when finding matches for control units. This balances out the levels of diabetes in the treated and control units. We have somewhat simplified things here - for instance, we don’t deal with how to calculate standard errors or p values in matched datasets. But now you have seen the basic approach in action

These exercises demonstrate that matching and stratification are different methods for covariate adjustment, but that both lead to valid answers. When matching on multiple different variables simultaneously, we might consider other types of matching, such as nearest neighbour matching, and use packages like MatchIt() to do that for us. That’s covered in the bonus exercises for today.

Matching and Causal Estimands

Now you have covered the basics of matching and stratification as two adjustment methods which you can apply when attempting to estimate the Average Treatment Effect. Before wrapping up, let’s explore a little bit what happens if we change how we apply our matching procedure a little. This exercise will give you a little preview of different types of causal estimand from the ATE.

Exercise 7

Explore what happens if, instead of matching treated to controls and controls to treated, you only match treated to control units. That is, we take the population of treated units, and find for each a matched control. Estimate the causal effect on this part of the matched dataset, and check what the distribution of diabetes is like in these matched pairs. What do you notice?

Code
# Treated -> Control matches
matched_tc <- fast_match(po_data, direction = "treated_to_control")

# Estimate ATE
ATT_matched <- mean(matched_tc$Y_treated - matched_tc$Y_control)

# explore distribution of diabetes
diabetes_matched_treated_table <- prop.table(table(matched_tc$diabetes))

The first thing we might notice is that the causal effect estimate is quite different than in the previous exercises.

Code
ATT_matched
[1] -1.807852

Since we know the true ATE, this is obviously an innacurate estimate of the ATE. But what is happening here? Our first clue is to check the distribution of diabetes among the matched pairs

Code
diabetes_matched_treated_table

        0         1 
0.2018994 0.7981006 

Notice that when we did matching in both directions we ended up with a dataset where diabetes had a fifty/fifty distribution. Now, however, we end up with a dataset where 80 percent have diabetes, and 20 percent do not. In fact, what we have done here is created a dataset where the control units have the same distribution of diabetes as the treated units.

When we match in this way, we do not get a valid estimate of the ATE, as the ATE summarizes the average treatment effect in the population as a whole. However, it turns out we can still interpret our estimate as a causal effect estimate, just for a different estimand. Specifically, we are here estimating the average causal effect, not in the population as a whole, but amongst the types of people who happened to be treated. This is known as the Average Treatment Effect amongst the Treated, abbreviated as the ATT. This is sometimes interpreted as the causal effect of being treated for those who actually got treated.

In real world research scenarios, there may be multiple reasons to target the ATE or the ATT; these are simply different causal estimands. We did not cover these in class, but this concept will return in future sessions. One may also attempt to estimate the ATU; the average treatment effect amongst the untreated. But we will not do this in this practical session.

Summary

These exercises demonstrate the techniques of stratification and matching in a simple univariate setting. In this setting, both approaches are equivalent, and it is not obvious that each have their own advantages and disadvantages. This becomes clearer when we move to the multivariate case; that is, when we have many (potential) confounding variables we would like to control for. This is also the setting where the concept of a propensity score becomes important, as a way of summarising information from multiple covariates. It is not necessary that you have practiced with multivariate adjustment methods by the end of today, as we will get some practice with this in the two days to come. But if you have time, please check out the bonus exercises, where we work through an example of multivariate confounder adjustment.