Code
<- readRDS("potential_outcomes_d1.rds")
po_data head(po_data[,c("Treatment_Received","Y_observed","Diabetes")])
Part 2: Estimating Causal Effects
In this practical, we’ll get to grips with the basics of estimating causal effects using stratification and matching. We will focus on the example datasets from the first practical, where we deal only with the case of a single confounding variable. The bonus exercises of today work through a larger multivariate example. Please take the practicals at your own pace - if you feel like you know what stratification and matching is already, feel free to skip to the bonus exercise.
The answers to each exercise are available as a collapsed code
block. Try to work out the answer yourself before looking at this code block!
Let’s get some practice with estimating causal effects using the same practice datasets from earlier, based on a hypothetical study of different weight-loss interventions. Load potential_outcomes_d1.rds
(which you dowloaded earlier form here into your R environment.
Recall we discussed running a study in which participants can choose whether to engage in the dieting intervention or the control (i.e. no-)intervention. We are able to enroll our entire population, and observe their weight-change 3 months after choosing their intervention. For this exercise we need the information in the following columns:
Treatment_Received
; records whether participants engaged in diet
or control
Y_observed
: records the weight-change value we observe after three months of following the chosen intervention plan.Diabetes
: records whether the participants have a prior diagnosis of diabetes or not<- readRDS("potential_outcomes_d1.rds")
po_data head(po_data[,c("Treatment_Received","Y_observed","Diabetes")])
Recall that in the earlier exercise we established that, while exchangeability was violated, the researchers suspected that this was due to the variable Diabetes
: having diabetes may make participants more or less likely to choose the diet intervention, and it likely also has an effect on the effectiveness of any weight loss intervention. In other words, the research team may be happy to assume that the treatment groups are exchangeable conditional on diabetes status.
Using the data, we can perform a quick descriptive analysis to determine whether, in fact, the distribution of Diabetes is different between treated and control units. If the researchers are correct and those with Diabetes are less likely to choose one of the interventions than the other, we should be able to see this in the data. This is also known as checking covariate balance between groups
When we have conditional exchangeability given a covariate (and when the other identifiability assumptions hold), then we can obtain an unbiased estimate of the causal effect in a number of different ways.
One way to estimate the average causal effect in this situation is to compare groups who are equal in value on that covariate. This is known as stratification: We split the population up into strata or subpopulations (that is, we condition on the covariate taking on a certain value), estimate group differences within those strata, and then combine those estimates again to get an estimate of the (average) causal effect in the population as a whole.
Let’s see this in action.
Let’s return to the example of the second research team from earlier, using the dataset potential_outcomes_d2.rds
from here. Recall that this related to a study team who wants to answer the question: What is the effect of engaging in weight-loss behaviours vs not engaging in any weight loss behaviours, on body weight, over a three month interval? In the Treatment_Received
column individuals are tagged as being in the treatment
group if they engaged in any weight-loss behaviour or intervention, and control
if they did not.
Matching as a method of covariate adjustment is conceptually similar to stratification. The basic approach is to search for people in both treatment groups that have the same score on the covariate or covariates you wish to adjust for (i.e., that have the same profile of covariate values). You then create a new dataset consisting of these matched pairs. After matching, the matched treated and control groups are guaranteed to be balanced on their covariate values: Every unit in the treated/control groups with diabetes is matched to (i.e. part of a pair with) a unit from the other group that also has diabetes, and vice versa for those without diabetes. Note that the same units may be matched multiple times; this allows us to account for the fact that, say, control units with diabetes are relatively more rare than treated units with diabetes.
There are many ways to perform matching, and many options which you can choose. In our case, we are performing exact matching on a single variable, so we supply you here with a simple function which will do the matching for you. There’s no need to understand the function in any detail. The key detail is that you must specify a direction for the matching, that is, whether you wish to find a control unit to match to every treated unit, or a treated unit to match to every control unit.
# this function works by either finding a control unit for every treated unit,
# or finding a treated unit for every control unit
# the mechanism of matching is to loop over values of diabetes
# if finding a control unit for every treated, we then sample with replacement rows of the control dataset who have the target value of diabetes
# combining the treated unit data with the sampled control unit data gives us matched pairs
<- function(mydata, direction = c("treated_to_control", "control_to_treated")) {
fast_match <- match.arg(direction)
direction
<- mydata[mydata$Treatment_Received == "diet", ]
treated <- mydata[mydata$Treatment_Received == "control", ]
control
<- list()
matched_list
for (d in unique(mydata$Diabetes)) {
if (direction == "treated_to_control") {
<- treated[treated$Diabetes == d, ]
from_group <- control[control$Diabetes == d, ]
to_group else if (direction == "control_to_treated") {
} <- control[control$Diabetes == d, ]
from_group <- treated[treated$Diabetes == d, ]
to_group
}
if (nrow(from_group) == 0 || nrow(to_group) == 0) next
<- sample(1:nrow(to_group), size = nrow(from_group), replace = TRUE)
sampled_indices <- to_group[sampled_indices, ]
matched_to
if (direction == "treated_to_control") {
<- data.frame(
matched treated_id = from_group$id,
control_id = matched_to$id,
diabetes = d,
Y_treated = from_group$Y_observed,
Y_control = matched_to$Y_observed
)else {
} <- data.frame(
matched treated_id = matched_to$id,
control_id = from_group$id,
diabetes = d,
Y_treated = matched_to$Y_observed,
Y_control = from_group$Y_observed
)
}
as.character(d)]] <- matched
matched_list[[
}
do.call(rbind, matched_list)
}
These exercises demonstrate that matching and stratification are different methods for covariate adjustment, but that both lead to valid answers. When matching on multiple different variables simultaneously, we might consider other types of matching, such as nearest neighbour matching, and use packages like MatchIt()
to do that for us. That’s covered in the bonus exercises for today.
Now you have covered the basics of matching and stratification as two adjustment methods which you can apply when attempting to estimate the Average Treatment Effect. Before wrapping up, let’s explore a little bit what happens if we change how we apply our matching procedure a little. This exercise will give you a little preview of different types of causal estimand from the ATE.
These exercises demonstrate the techniques of stratification and matching in a simple univariate setting. In this setting, both approaches are equivalent, and it is not obvious that each have their own advantages and disadvantages. This becomes clearer when we move to the multivariate case; that is, when we have many (potential) confounding variables we would like to control for. This is also the setting where the concept of a propensity score becomes important, as a way of summarising information from multiple covariates. It is not necessary that you have practiced with multivariate adjustment methods by the end of today, as we will get some practice with this in the two days to come. But if you have time, please check out the bonus exercises, where we work through an example of multivariate confounder adjustment.