library(here)
library(tidyverse)
library(marginaleffects)
library(AIPW)
library(tmle)
library(ranger)
library(bartMachine)
library(boot)
library(clarify)
library(cobalt)
library(estimatr)
library(WeightIt)
nhefs <- read_csv("nhefs.csv")
# remove participants missing outcome data
nhefs <-
nhefs |>
filter(!is.na(wt82_71))
# set type for factor variables
nhefs <-
nhefs |>
mutate(across(c(education,exercise,active), as.factor))
# fit outcome model
outcome_mod <-
lm(wt82_71 ~ qsmk + sex + race + poly(age,2) + education +
poly(smokeintensity,2) + poly(smokeyrs,2) +
exercise + active + poly(wt71,2) + qsmk:smokeintensity,
data = nhefs)This is the first post in a series looking at g-methods and their doubly robust extensions for estimating causal effects in epidemiological research. This one is focused on time fixed treatments (or exposures) and the second will cover the case of time varying treatments.
Thank you to Noah Greifer and Cameron Patrick for some very helpful notes on a previous version
Potential outcomes
The goal of causally motivated epidemiology research is often to estimate the average treatment effect (ATE). For a binary treatment or exposure \(A\) the ATE on the difference scale is defined as: \[ E[Y_i^{a=1} - Y_i^{a=0}]\]
Where \(Y_i^{a=1}\) denotes the value of the outcome variable when treatment is given for individual \(i\) and \(Y_i^{a=0}\) denotes the value of the outcome variable when treatment is not given for individual \(i\) (from here on, the \(i\) subscript will be implicit). As individuals in a study only receive one of the two treatments on offer (they are either treated or untreated), only one these outcomes is actually observed for a given individual. For this reason, the outcomes \(Y^{a=1}\) and \(Y^{a=0}\) are often referred to as potential or counterfactual outcomes as they represent the value of the outcome variable in an individual had they, potentially counter-to-fact, received the treatment level \(a\).
The ATE quantifies the average difference between the two potential outcomes. Put another way, it quantifies the average difference in the level of \(Y\) in a hypothetical world in which everyone been treated (\(a=1\)) and the level of \(Y\) in a hypothetical world in which no-one had been treated (\(a=0\)), all else being equal. It is the estimated effect of moving the entire population from untreated to treated.
Given that the ATE is defined in terms of potential, rather than observed, outcomes, it raises the question: how can we estimate the ATE with real data? To do so, a few (mostly untestable) assumptions must be made.
Identifying conditions
The conditions that allow us to make the leap from observed data to contrasts of potential outcomes are referred to as the ‘identifying conditions’. A causal effect is said to be identified if it can be computed as a function of the observed data.
The three core conditions are as follows:
Consistency
Consistency is the assumption that the potential outcome for an individual under treatment \(a\) is equal to their observed outcome under treatment \(a\). Formally, \(Y^a = Y\) for individuals with \(A = a\). In practice, what this assumption means is that the intervention under study must be sufficiently well-defined for potential outcomes to be well-defined.
This might not seem like a high bar to clear but the implications of the consistency assumption are subtle and far reaching. This paper by Miguel Hernán draws out some of the issues.
(Conditional) exhangeability
Conditional exchangeability implies that, within levels of the measured confounding variables \(L\), treated and untreated groups are exchangeable. In other words, treatment is effectively randomly assigned within levels of the measured confounding variables. This is written formally as \(Y^a \perp A|L\). I.e., potential outcomes are independent of the treatment received, conditional on confounders \(L\).
Positivity
Positivity implies that for all individuals, there is some (positive) chance of receiving each of the treatment levels. Formally, \(Pr[A=a|L=l] > 0\) for all \(Pr[L=l] > 0\). Positivity violations can be random or structural. Structural nonpositivity occurs when a subgroup cannot possibly be treated (or untreated). Random nonpositivity occurs when, by chance alone, some strata within \(L\) are exclusively treated or untreated.
When these three essential conditions are met (or approximately met) for a study, we may then estimate causal effects with the data we have at hand. One set of methods to do so are referred to as the g-methods (the g stands for ‘generalised’), a set of methods for estimating treatment effects in the context of time fixed or time-varying treatments.
The g-methods consist of two main strategies for estimating treatment effects: the g-formula and inverse probability of treatment weighting (IPTW). (There is a third method, g-estimation, which is not particularly well supported by popular software so won’t be discussed here. I apologise for the terminology, I’m not responsible!). Each method will now be introduced and applied with an example of a time-fixed treatment.
The g-formula
Assuming exchangeability, positivity and consistency hold, the g-formula for a time fixed treatment is:
\[ E[Y^a] = \sum_l E[Y|A=a,L=l]Pr[L=l] \]
Here \(E[Y^a]\) is the counterfactual mean of \(Y\) under treatment \(a\) and \(L\) is a vector of confounding variables. The sum indicates that we are taking a weighted average of the conditional mean of \(Y\) where weights are the prevalence of each value \(l\) in the population.
The above equation only works when the variables within \(L\) are discrete, otherwise the sum becomes an integral:
\[ E[Y^a] = \int E[Y|A=a,L=l] dF_L[L=l] \] Where \(F_L\) is the joint CDF of the random variables in \(L\). This is going to be painful to deal with, especially when multiple confounders are present.
Thankfully, we do not need to obtain \(Pr[L=l]\) or \(F_L[L=l]\). What we can do instead is estimate \(E[Y|A=a,L=l]\) for the particular \(l\) in the sample and then compute the average: \[ E[Y^a] = \frac{1}{n}\sum_{i=1}^{n}\hat{E}[Y|A=a, L_i] \]
The process of averaging over the distribution of confounders in the sample is known as standardisation.
The computational means for applying the g-formula are as follows:
- Fit a model (e.g., linear or logistic regression) for the conditional mean of the outcome given treatment and confounders
- Create two copies of the dataset. In the first, set treatment to 0 (untreated) and in the second, set treatment to 1 (treated)
- Compute predicted values from the fitted model for each of the two artificial datasets
- Average each set of predicted values, giving \(E[Y^{a=1}]\) and \(E[Y^{a=0}]\)
- Contrast the predicted values. E.g., \(E[Y^{a=1}] - E[Y^{a=0}]\)
When a parametric model is used to estimate the conditional mean of the outcome given treatment and the confounders, this method is referred to as the parametric g-formula.
The g-formula ‘by hand’
The NHEFS dataset available here, used by Hernán and Robins in their book Causal Inference, will be used as an example. The goal is to estimate the effect of smoking cessation between baseline (1971) and follow-up (1982) on weight gain during the same interval. First, we’ll load the data and fit a linear regression model (step 1). We fit the same model as Hernan and Robins, including confounding variables sex, race, age, education, smoking intensity, smoking years, exercise, activeness, and baseline weight. Departures from linearity are allowed for continuous variables by including a quadratic term with the poly() function.
For step 2 and 3, we compute predictions for two artificial copies of the dataset, one in which treatment (qsmk) is set to 0 and one in which treatment is set to 1, respectively. Then, we take their average and compute the ATE.
fitted.0 <- predict(outcome_mod,
newdata = mutate(nhefs, qsmk = 0))
fitted.1 <- predict(outcome_mod,
newdata = mutate(nhefs, qsmk = 1))
E_1 <- mean(fitted.1)
E_0 <- mean(fitted.0)
treatment_effect <- E_1 - E_0
print(paste("E[Y(1)] - E[Y(0)] = ",round(treatment_effect,2)))[1] "E[Y(1)] - E[Y(0)] = 3.52"
The estimated effect of smoking cessation, relative to no cessation, is a weight gain of 3.52 kg.
Confidence intervals can be obtained by bootstrapping. Note that less computationally costly options are also available, as demonstrated below.
parametric_g <- function(data, indices){
df <- nhefs[indices,]
outcome_mod <-
lm(wt82_71 ~ qsmk + sex + race + poly(age,2) + education +
poly(smokeintensity,2) + poly(smokeyrs,2) +
exercise + active + poly(wt71,2) + qsmk:smokeintensity,
data = df)
fitted.0 <- predict(outcome_mod, newdata = df |> mutate(qsmk = 0))
fitted.1 <- predict(outcome_mod, newdata = df |> mutate(qsmk = 1))
E_1 <- mean(fitted.1)
E_0 <- mean(fitted.0)
treatment_effect <- E_1 - E_0
return(treatment_effect)
}
boot_out <- boot(data = nhefs,
statistic = parametric_g,
R = 500)
boot.ci(boot_out, conf = 0.95, type = "perc")BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates
CALL :
boot.ci(boot.out = boot_out, conf = 0.95, type = "perc")
Intervals :
Level Percentile
95% ( 2.486, 4.387 )
Calculations and Intervals on Original Scale
R packages for the parametric g-formula
marginaleffects
The R package marginaleffects computes the ATE by standardisation, with standard errors obtained by the delta method:
avg_comparisons(outcome_mod, variables = "qsmk")
Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
3.52 0.44 7.99 <0.001 49.4 2.65 4.38
Term: qsmk
Type: response
Comparison: mean(1) - mean(0)
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted
Clarify
The package clarify computes the ATE via standardisation and obtains confidence intervals by simulation. A useful quality of clarify’s simulation based inference is that it does not assume normality of the sampling distribution of the estimated quantity (unlike the delta method). This is useful for parameters with natural bounds like probabilities.
sim_coefs <- sim(outcome_mod)
sim_est <- sim_ame(sim_coefs, var = "qsmk", contrast = "diff", verbose = FALSE)
summary(sim_est) Estimate 2.5 % 97.5 %
E[Y(0)] 1.76 1.34 2.19
E[Y(1)] 5.27 4.60 6.03
Diff 3.52 2.71 4.38
The confidence intervals produced by all three methods are similar.
Beyond the core identification conditions described above, there is another essential assumption for the parametric g-formula to be unbiased. That is that the conditional mean outcome model is correctly specified. Even if you have meausured all important confounders there may still be bias if you have misspecified the model (e.g., missed important interactions or non-linearities). The method described in the next section relies on a different modelling assumption - that is, that the model for treatment is correctly specified.
IPTW
The aim of inverse probability of treatment weighting (IPTW) is to create a pseudo-population in which treatment \(A\) and confounders \(L\) are statistically independent. Provided identification conditions are satisfied, \(E[Y^a]\) in the actual population is equal to \(E_{ps}[Y|A=a]\) in the pseudo-population. The pseudo-population is created by weighting each individual by the inverse of the probability of receiving the treatment that they received.
For a binary treatment, the weights can be defined in terms of the propensity score (PS), which is the conditional probability of treatment given confounders \(Pr[A=1|L=l]\). Weights \(W^A\) are equal to \(1/PS\) for those who received treatment and \(1/(1-PS)\) for the untreated. The reason the denominator is \(1-PS\) for the untreated is that \(1-PS\) is the conditional probability of being untreated, and we want to weight individuals by the inverse of the probability of receiving the treatment they actually received.
The steps for computing the ATE by IPTW are as follows:
- Fit a model to estimate the PS (often a logistic regression model for a binary treatment)
- Compute \(W^A\) as \(1/PS\) and \(1/(1-PS)\) for those who received treatment and those who were untreated, respectively
- Estimate the ATE by fitting a weighted regression model for the outcome
Provided identifying conditions are satisfied, the model fitted to the pseudo-population created in step 3 has the form: \[E[Y^a]=\beta_0 + \beta_1 a\] The parameter \(\beta_1\) estimates the ATE. This is referred to as a marginal structural model. It is marginal because we are predicting a marginal potential outcome \(E[Y^a]\) and structural as it is a model for a counterfactual, rather than fully observed, outcome.
IPT weighting ‘by hand’
The code below performs steps 1 and 2:
# fit model for treatment given confounders and compute weights
treatment_mod <-
glm(qsmk ~ sex + race + poly(age,2) + education +
poly(smokeintensity,2) + poly(smokeyrs,2) +
exercise + active + poly(wt71,2),
family = binomial(), data = nhefs)
ps <- predict(treatment_mod, type = "response") # propensity score
ip_weights <- ifelse(nhefs$qsmk == 1,
1 / ps,
1 / (1 - ps))Now we can fit a weighted regression model to estimate the ATE. Because IP weighting must be taken into account for standard errors to be correct, we use a method which provides robust ‘sandwich’ type standard errors (bootstrapping is an another option).
# fit weighted model (step 3)
msm <- lm_robust(wt82_71 ~ qsmk,
data = nhefs,
weights = ip_weights)
summary(msm)
Call:
lm_robust(formula = wt82_71 ~ qsmk, data = nhefs, weights = ip_weights)
Weighted, Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 1.780 0.2248 7.917 4.582e-15 1.339 2.221 1564
qsmk 3.441 0.5265 6.535 8.574e-11 2.408 4.473 1564
Multiple R-squared: 0.0435 , Adjusted R-squared: 0.04289
F-statistic: 42.71 on 1 and 1564 DF, p-value: 8.574e-11
The IPTW estimate for smoking cessation is a 3.44 kg weight gain, with a 95% CI ranging from 2.4 to 4.5. Very similar to that obtained by the parametric g-formula. This is a reassuring sign as these two methods rely on different modeling assumptions.
It is worth noting that the IPTW ATE can be computed by an estimator which is very similar (though not quite identical) to the weighted regression above. This is the Horvitz-Thompson estimator:
\[ \hat{\tau}_{ATE}^{IPTW} = \frac{1}{N}\sum_{i=1}^{N}\frac{A_iY_i}{\hat{PS_i}} - \frac{(1-A_i)Y_i}{1-\hat{PS_i}} \] This is noted now as it will be relevant when discussing augmented inverse probability weighting (AIPW) later on.
R packages for IPTW
A particularly useful package for estimating the ATE via IPTW is the WeightIt package.
Though several methods are available for estimating weights, including machine learning algorithms and several other balancing methods which I do not presently understand, we’ll use logistic regression for consistency with the “by hand” approach:
ip_weights <-
weightit(qsmk ~ sex + race + poly(age,2) + education +
poly(smokeintensity,2) + poly(smokeyrs,2) +
exercise + active + poly(wt71,2),
method = "glm", data = nhefs,
estimand = "ATE")As discussed by Austin & Stuart (2015), it is best practise to assess that the estimated weights are working as they should and that measured confounders are balanced across levels of treatment in the pseudopopulation produced by the weights. This can be easily checked using The bal.tab function of the cobalt package.
We’ll use cobalt to assess balance in terms of standardised mean differences for continuous variables and proportion differences for binary variables. We’ll also compute the Kolmogorov–Smirnov statistic, as Austin and Stuart recommend.
bal.tab(ip_weights,
stats = c("m", "ks"))Balance Measures
Type Diff.Adj KS.Adj
prop.score Distance 0.0098 0.0425
sex Binary -0.0014 0.0014
race Binary 0.0023 0.0023
poly(age, 2)1 Contin. 0.0058 0.0293
poly(age, 2)2 Contin. 0.0034 0.0340
education_1 Binary -0.0102 0.0102
education_2 Binary 0.0010 0.0010
education_3 Binary 0.0020 0.0020
education_4 Binary 0.0070 0.0070
education_5 Binary 0.0003 0.0003
poly(smokeintensity, 2)1 Contin. -0.0241 0.0310
poly(smokeintensity, 2)2 Contin. -0.0127 0.0307
poly(smokeyrs, 2)1 Contin. -0.0035 0.0428
poly(smokeyrs, 2)2 Contin. 0.0109 0.0468
exercise_0 Binary -0.0046 0.0046
exercise_1 Binary 0.0182 0.0182
exercise_2 Binary -0.0136 0.0136
active_0 Binary -0.0089 0.0089
active_1 Binary 0.0141 0.0141
active_2 Binary -0.0051 0.0051
poly(wt71, 2)1 Contin. -0.0090 0.0358
poly(wt71, 2)2 Contin. 0.0039 0.0515
Effective sample sizes
Control Treated
Unadjusted 1163. 403.
Adjusted 1128.61 325.97
All the confounders are well balanced.
Now that balance has been checked, a weighted regression model can be fitted in the same manner as before.
msm <- lm_robust(wt82_71 ~ qsmk,
data = nhefs,
weights = ip_weights$weights)
summary(msm)
Call:
lm_robust(formula = wt82_71 ~ qsmk, data = nhefs, weights = ip_weights$weights)
Weighted, Standard error type: HC2
Coefficients:
Estimate Std. Error t value Pr(>|t|) CI Lower CI Upper DF
(Intercept) 1.780 0.2248 7.917 4.582e-15 1.339 2.221 1564
qsmk 3.441 0.5265 6.535 8.574e-11 2.408 4.473 1564
Multiple R-squared: 0.0435 , Adjusted R-squared: 0.04289
F-statistic: 42.71 on 1 and 1564 DF, p-value: 8.574e-11
Doubly robust estimation
Key assumptions of the parametric g-formula and IPTW estimators are that the models for the outcome and treatment, respectively, are correctly specified. Doubly robust (DR) estimators, on the other hand, combine models for both treatment and the outcome and by doing so, give the analyst two chances to get the model right. That is, a DR estimator will be consistent (in the statistical sense of converging to the true parameter as \(n\to\infty\)) so long as at least one of the outcome or treatment models are correctly specified.
While several DR estimators have been proposed, we will focus on augmented inverse probability weighting (AIPW) and targeted maximum likelihood estimation (TMLE), prominent examples which are well supported by current software.
AIPW
The AIPTW estimator is a doubly robust estimator of the ATE which, as the name suggests, “augments” the IPTW estimator with additional terms which improve efficiency and provide double robustness. Before covering AIPW in more detail, it is helpful to return to the previously mentioned IPTW estimator of the ATE: \[ \hat{\tau}_{ATE}^{IPTW} = \frac{1}{N}\sum_{i=1}^{N}\frac{A_iY_i}{\hat{PS_i}} - \frac{(1-A_i)Y_i}{1-\hat{PS_i}} \]
Where \(PS\) is the estimated propensity score. The AIPW estimator augments this IPTW estimator by additionally incorporating a model for the outcome in its estimation. The estimator is written as:
\[ \hat{\tau}_{ATE}^{AIPW} = \frac{1}{n}\sum_{i=1}^{n}\Bigg(\frac{A_iY_i}{\hat{PS_i}} - \frac{A_i - \hat{PS_i}}{\hat{PS_i}}\hat{Y_1}_i\Bigg)- \Bigg(\frac{(1-A_i)Y_i}{1-\hat{PS_i}} - \frac{A_i - \hat{PS_i}}{1-\hat{PS_i}}\hat{Y_0}_i\Bigg) \] \(\hat{Y_1}_i\) and \(\hat{Y_0}_i\) indicate the predicted conditional mean of the outcome when treatment is given and not given, respectively (these correspond to ‘fitted.1’ and ‘fitted.0’ that were computed as part of the g-formula section).
The parts on the left side of each set of big brackets are identical to the IPTW estimator above. The parts on the right are the augmentation components. These components provide the double robustness. That is, so long as either the model for treatment or the outcome are correctly specified, the estimator will be consistent (a proof of this property is given in Funk et al., (2011)).
AIPW ‘by hand’
All of the required components for applying AIPW, the propensity score and conditional mean of the outcome under treated and untreated, were estimated in previous steps. Nonetheless, we’ll estimate them again for sake of completeness.
First, fit the treatment and outcome models and compute \(\hat{Y_1}_i\), \(\hat{Y_0}_i\) and \(\hat{PS}_i\):
outcome_mod <-
lm(wt82_71 ~ qsmk + sex + race + poly(age,2) + education +
poly(smokeintensity,2) + poly(smokeyrs,2) +
exercise + active + poly(wt71,2) + qsmk:smokeintensity,
data = nhefs)
fitted.0 <- predict(outcome_mod,
newdata = nhefs |> mutate(qsmk = 0))
fitted.1 <- predict(outcome_mod,
newdata = nhefs |> mutate(qsmk = 1))
treatment_mod <-
glm(qsmk ~ sex + race + poly(age,2) + education +
poly(smokeintensity,2) + poly(smokeyrs,2) +
exercise + active + poly(wt71,2),
family = binomial(), data = nhefs)
ps <- predict(treatment_mod, type = "response") # propensity scoreSecondly, compute the AIPW ATE:
aipw <- function(A, Y, PS, Y_1, Y_0){
out <- mean((((A*Y)/PS) - ((A - PS)/PS) * Y_1) -
((((1-A)*Y)/(1-PS)) - ((A - PS)/(1-PS)) * Y_0))
out
}
aipw_out <- aipw(nhefs$qsmk, nhefs$wt82_71,
ps, fitted.1, fitted.0)
paste("AIPW ATE =", round(aipw_out,2))[1] "AIPW ATE = 3.43"
Confidence intervals can be obtained by bootstrapping or based on the efficient influence curve.
AIPW using SuperLearner
Previously, we have estimated causal effects using the standard parametric models linear and logistic regression (this was by no means a requirement, the g-formula and IPTW are compatible with other choices but these were chosen for ease of demonstration). A problem with these simple parametric models is that they tend to impose restrictions on the manner in which covariates are related to the treatment or outcome. For instance, they often assume there are no or few interactions and that continuous variables are linearly related to the outcome/treatment. Consequently, some amount of model mispecification is very likely in real settings. By contrast, highly flexible machine learning models impose few such restrictions and may be less prone to misspecification related bias.
To venture into the world of machine learning, we will apply the AIPW estimator using the R package AIPW. AIPW makes use of the SuperLearner algorithm to estimate \(\hat{Y_1}_i\), \(\hat{Y_0}_i\), and \(\hat{PS_i}\) in the above expression. SuperLearner is a stacking algorithm which uses several user specified machine learning algorithms, estimates their performance using cross-validation, and then computes an optimally weighted average of those models predictions (an “ensemble”).
To estimate the effect of smoking cessation, we will use AIPW with generalised linear models, penalised maximum likelihood (glmnet), random forests, and Bayesian adaptive regression trees (BART) included in the SuperLearner algorithm. Standard errors are computed based on the efficient influence curve.
#! cache: true
confounds <-
select(nhefs, sex, race, age, education,
smokeintensity, smokeyrs, exercise, active, wt71)
SL.libs <- c("SL.glm","SL.glmnet",
"SL.ranger","SL.bartMachine")
aipw_sl <- AIPW$new(Y=nhefs$wt82_71,
A=nhefs$qsmk,
W.Q=confounds,
W.g=confounds,
Q.SL.library=SL.libs,
g.SL.library=SL.libs,
k_split=5,
verbose=FALSE)
aipw_sl$fit()
aipw_sl$summary()aipw_sl$result Estimate SE 95% LCL 95% UCL N
Risk of exposure 4.907530 0.6708074 3.592747 6.222312 403
Risk of control 1.751597 0.2178300 1.324650 2.178543 1163
Risk Difference 3.155933 0.6989296 1.786031 4.525835 1566
The results are very similar to those obtained by the previous methods.
TMLE
TMLE is a doubly robust procedure for the estimation of causal effects which is similar to AIPW. Indeed, the two are asymptotically equivalent. In practical settings, however, TMLE can sometimes outperform AIPW as it is less sensitive to the extreme weights which can occur under random nonpositivity.
Applying the TMLE estimator proceeds like so:
- Fit a model for the the outcome, including treatment and confounders, to obtain \(\hat{Y_1}_i\) and \(\hat{Y_0}_i\). The predicted value of the outcome under treatment given and treatment not given, respectively.
- Fit a model for treatment and estimate the propensity score \(\hat{PS}_i\)
- Compute the ‘clever covariates’ \(H1_i\) and \(H0_i\) which are equal to \(A_i/\hat{PS}_i\) and \((1-A_i)/(1-\hat{PS}_i)\), respectively
- Fit a new model for the residuals of \(Y\) from the outcome model including clever covariates \(H1\) and \(H0\)
- ‘Update’ the estimation of \(\hat{Y_1}_i\), and \(\hat{Y_0}_i\) using an expression which incorprates the estimated coefficients for the clever covariates \(H1\) and \(H0\). That is, \(\hat{Y_1}^{new} = \hat{Y_1}^{old}_i + \frac{\hat{\epsilon_1}}{\hat{PS_i}}\) \(\hat{Y_0}^{new} = \hat{Y_0}^{old}_i + \frac{\hat{\epsilon_2}}{1-\hat{PS_i}}\)
- Compute the ATE based on the updated estimates \(\hat{Y_1}^{new}_i\), and \(\hat{Y_0}^{new}_i\)
The reason that \(H1_i\) and \(H0_i\) are referred to as ‘clever covariates’ is that they provide double robustness and allow for inference based on the efficient influence curve.
TMLE ‘by hand’
First, we fit a model for the outcome and compute the desired predictions:
outcome_mod <-
lm(wt82_71 ~ qsmk + sex + race + poly(age,2) + education +
poly(smokeintensity,2) + poly(smokeyrs,2) +
exercise + active + poly(wt71,2) + qsmk:smokeintensity,
data = nhefs)
fitted.0 <- predict(outcome_mod,
newdata = mutate(nhefs, qsmk = 0))
fitted.1 <- predict(outcome_mod,
newdata = mutate(nhefs, qsmk = 1))Then we fit a model for treatment, obtain the propensity score and compute the clever covariates.
treatment_mod <-
glm(qsmk ~ sex + race + poly(age,2) + education +
poly(smokeintensity,2) + poly(smokeyrs,2) +
exercise + active + poly(wt71,2),
family = binomial(), data = nhefs)
ps <- predict(treatment_mod, type = "response") # propensity score
H0 <- (1-nhefs$qsmk)/(1-ps)
H1 <- nhefs$qsmk / psThen, fit the model to the residuals of Y including clever covariates:
resid <- residuals(outcome_mod)
resid_mod <- lm(resid ~ H0 + H1)Finally, estimate \(\hat{Y_1}^{new}\) and \(\hat{Y_0}^{new}\) and compute the ATE:
fitted.1.new <- fitted.1 + (coef(resid_mod)["H1"]/ps)
fitted.0.new <- fitted.0 + (coef(resid_mod)["H0"]/(1-ps))
ATE_out <- mean(fitted.1.new - fitted.0.new)
paste("TMLE ATE =", round(ATE_out, 2))[1] "TMLE ATE = 3.46"
Like AIPW, confidence intervals could be obtained based on the efficient influence curve or by bootstrapping. See Luque‐Fernandez et al (2017) for a demonstration of the former.
TMLE using SuperLearner
For a very final demonstration, we will estimate the TMLE ATE using the R package tmle. Again, we will use SuperLearner as the prediction algorithm, using the same set of libraries as for AIPW.
#! cache: true
SL.libs <- c("SL.glm","SL.glmnet",
"SL.ranger","SL.bartMachine")
tmle.out <- tmle(Y = nhefs$wt82_71,
A = nhefs$qsmk,
W = confounds,
family = "gaussian",
V.Q = 5,
V.g = 5,
Q.SL.library = SL.libs,
g.SL.library = SL.libs)tmle.out Marginal mean under treatment (EY1)
Parameter Estimate: 5.1703
Estimated Variance: 0.17413
p-value: <2e-16
95% Conf Interval: (4.3524, 5.9882)
Marginal mean under comparator (EY0)
Parameter Estimate: 1.7488
Estimated Variance: 0.046878
p-value: 6.6289e-16
95% Conf Interval: (1.3244, 2.1732)
Additive Effect
Parameter Estimate: 3.4215
Estimated Variance: 0.21337
p-value: 1.2912e-13
95% Conf Interval: (2.5161, 4.3268)
Additive Effect among the Treated
Parameter Estimate: 3.4633
Estimated Variance: 0.21727
p-value: 1.0855e-13
95% Conf Interval: (2.5497, 4.3769)
Additive Effect among the Controls
Parameter Estimate: 3.3705
Estimated Variance: 0.21767
p-value: 5.0397e-13
95% Conf Interval: (2.4561, 4.2849)
Again, results are very similar.
Conclusion
In summary, here are a few of what I think are a few key takeaways:
- Even when conditions for causal identification are met, unbiased estimation of causal effects requires correct model specification
- The parametric g-formula and IPTW can be straightforwardly applied using tools most researchers are already familiar with
- Doubly robust estimators are straightforward extensions of the g-formula and IPTW and, as they readily allow for the incorporation of machine learning libraries, should generally be preferred over singly robust alternatives
In part 2, I will cover the case of time varying treatments and demonstrate how g-methods and their doubly robust extensions become essential in this setting.
References
Hernán, M. A., & Taubman, S. L. (2008). Does obesity shorten life? The importance of well-defined interventions to answer causal questions. International journal of obesity, 32(3), S8-S14.
Hernán, M. A., & Robins, J. M. (2010). Causal inference.
Arel-Bundock V (2023). marginaleffects: Predictions, Comparisons, Slopes, Marginal Means, and Hypothesis Tests. R package version 0.9.0, https://vincentarelbundock.github.io/marginaleffects/.
Greifer N, Worthington S, Iacus S, King G (2023). clarify: Simulation-Based Inference for Regression Models. https://github.com/iqss/clarify, https://iqss.github.io/clarify/.
Austin, P. C., & Stuart, E. A. (2015). Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Statistics in medicine, 34(28), 3661-3679.
Greifer N (2023). WeightIt: Weighting for Covariate Balance in Observational Studies. https://ngreifer.github.io/WeightIt/, https://github.com/ngreifer/WeightIt.
Greifer N (2023). cobalt: Covariate Balance Tables and Plots. https://ngreifer.github.io/cobalt/, https://github.com/ngreifer/cobalt.
Michele Jonsson Funk and others, Doubly Robust Estimation of Causal Effects, American Journal of Epidemiology, Volume 173, Issue 7, 1 April 2011, Pages 761–767, https://doi.org/10.1093/aje/kwq439
Yongqi Zhong and others, AIPW: An R Package for Augmented Inverse Probability–Weighted Estimation of Average Causal Effects, American Journal of Epidemiology, Volume 190, Issue 12, December 2021, Pages 2690–2699, https://doi.org/10.1093/aje/kwab207
Van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology, 6(1).
Luque‐Fernandez, M. A., Schomaker, M., Rachet, B., & Schnitzer, M. E. (2018). Targeted maximum likelihood estimation for a binary treatment: A tutorial. Statistics in medicine, 37(16), 2530-2546.