Introduction to g-methods: time fixed treatments

This is the first post in a series looking at g-methods and their doubly robust extensions for estimating causal effects in epidemiological research. This one is focused on time fixed treatments (or exposures) and the second will cover the case of time varying treatments.

Thank you to Noah Greifer and Cameron Patrick for some very helpful notes on a previous version

Potential outcomes

The goal of causally motivated epidemiology research is often to estimate the average treatment effect (ATE). For a binary treatment or exposure $A$ the ATE on the difference scale is defined as: $E [Y_{i}^{a = 1} - Y_{i}^{a = 0}]$

Where $Y_{i}^{a = 1}$ denotes the value of the outcome variable when treatment is given for individual $i$ and $Y_{i}^{a = 0}$ denotes the value of the outcome variable when treatment is not given for individual $i$ (from here on, the $i$ subscript will be implicit). As individuals in a study only receive one of the two treatments on offer (they are either treated or untreated), only one these outcomes is actually observed for a given individual. For this reason, the outcomes $Y^{a = 1}$ and $Y^{a = 0}$ are often referred to as potential or counterfactual outcomes as they represent the value of the outcome variable in an individual had they, potentially counter-to-fact, received the treatment level $a$ .

The ATE quantifies the average difference between the two potential outcomes. Put another way, it quantifies the average difference in the level of $Y$ in a hypothetical world in which everyone been treated ( $a = 1$ ) and the level of $Y$ in a hypothetical world in which no-one had been treated ( $a = 0$ ), all else being equal. It is the estimated effect of moving the entire population from untreated to treated.

Given that the ATE is defined in terms of potential, rather than observed, outcomes, it raises the question: how can we estimate the ATE with real data? To do so, a few (mostly untestable) assumptions must be made.

Identifying conditions

The conditions that allow us to make the leap from observed data to contrasts of potential outcomes are referred to as the ‘identifying conditions’. A causal effect is said to be identified if it can be computed as a function of the observed data.

The three core conditions are as follows:

Consistency

Consistency is the assumption that the potential outcome for an individual under treatment $a$ is equal to their observed outcome under treatment $a$ . Formally, $Y^{a} = Y$ for individuals with $A = a$ . In practice, what this assumption means is that the intervention under study must be sufficiently well-defined for potential outcomes to be well-defined.

This might not seem like a high bar to clear but the implications of the consistency assumption are subtle and far reaching. This paper by Miguel Hernán draws out some of the issues.

(Conditional) exhangeability

Conditional exchangeability implies that, within levels of the measured confounding variables $L$ , treated and untreated groups are exchangeable. In other words, treatment is effectively randomly assigned within levels of the measured confounding variables. This is written formally as $Y^{a} ⊥ A | L$ . I.e., potential outcomes are independent of the treatment received, conditional on confounders $L$ .

Positivity

Positivity implies that for all individuals, there is some (positive) chance of receiving each of the treatment levels. Formally, $P r [A = a | L = l] > 0$ for all $P r [L = l] > 0$ . Positivity violations can be random or structural. Structural nonpositivity occurs when a subgroup cannot possibly be treated (or untreated). Random nonpositivity occurs when, by chance alone, some strata within $L$ are exclusively treated or untreated.

When these three essential conditions are met (or approximately met) for a study, we may then estimate causal effects with the data we have at hand. One set of methods to do so are referred to as the g-methods (the g stands for ‘generalised’), a set of methods for estimating treatment effects in the context of time fixed or time-varying treatments.

The g-methods consist of two main strategies for estimating treatment effects: the g-formula and inverse probability of treatment weighting (IPTW). (There is a third method, g-estimation, which is not particularly well supported by popular software so won’t be discussed here. I apologise for the terminology, I’m not responsible!). Each method will now be introduced and applied with an example of a time-fixed treatment.

The g-formula

Assuming exchangeability, positivity and consistency hold, the g-formula for a time fixed treatment is:

$E [Y^{a}] = \sum_{l} E [Y | A = a, L = l] P r [L = l]$

Here $E [Y^{a}]$ is the counterfactual mean of $Y$ under treatment $a$ and $L$ is a vector of confounding variables. The sum indicates that we are taking a weighted average of the conditional mean of $Y$ where weights are the prevalence of each value $l$ in the population.

The above equation only works when the variables within $L$ are discrete, otherwise the sum becomes an integral:

$E [Y^{a}] = \int E [Y | A = a, L = l] d F_{L} [L = l]$ Where $F_{L}$ is the joint CDF of the random variables in $L$ . This is going to be painful to deal with, especially when multiple confounders are present.

Thankfully, we do not need to obtain $P r [L = l]$ or $F_{L} [L = l]$ . What we can do instead is estimate $E [Y | A = a, L = l]$ for the particular $l$ in the sample and then compute the average: $E [Y^{a}] = \frac{1}{n} \sum_{i = 1}^{n} \hat{E} [Y | A = a, L_{i}]$

The process of averaging over the distribution of confounders in the sample is known as standardisation.

The computational means for applying the g-formula are as follows:

Fit a model (e.g., linear or logistic regression) for the conditional mean of the outcome given treatment and confounders
Create two copies of the dataset. In the first, set treatment to 0 (untreated) and in the second, set treatment to 1 (treated)
Compute predicted values from the fitted model for each of the two artificial datasets
Average each set of predicted values, giving $E [Y^{a = 1}]$ and $E [Y^{a = 0}]$
Contrast the predicted values. E.g., $E [Y^{a = 1}] - E [Y^{a = 0}]$

When a parametric model is used to estimate the conditional mean of the outcome given treatment and the confounders, this method is referred to as the parametric g-formula.

The g-formula ‘by hand’

The NHEFS dataset available here, used by Hernán and Robins in their book Causal Inference, will be used as an example. The goal is to estimate the effect of smoking cessation between baseline (1971) and follow-up (1982) on weight gain during the same interval. First, we’ll load the data and fit a linear regression model (step 1). We fit the same model as Hernan and Robins, including confounding variables sex, race, age, education, smoking intensity, smoking years, exercise, activeness, and baseline weight. Departures from linearity are allowed for continuous variables by including a quadratic term with the poly() function.

library(here)
library(tidyverse)
library(marginaleffects)
library(AIPW)
library(tmle)
library(ranger)
library(bartMachine)
library(boot)
library(clarify)
library(cobalt)
library(estimatr)
library(WeightIt)

nhefs <- read_csv("nhefs.csv")

# remove participants missing outcome data

nhefs <-
  nhefs |>
  filter(!is.na(wt82_71))

# set type for factor variables
nhefs <-
  nhefs |>
  mutate(across(c(education,exercise,active), as.factor))

# fit outcome model
outcome_mod <-
  lm(wt82_71 ~ qsmk + sex + race + poly(age,2) + education +
       poly(smokeintensity,2) + poly(smokeyrs,2) +
       exercise + active + poly(wt71,2) + qsmk:smokeintensity,
     data = nhefs)

For step 2 and 3, we compute predictions for two artificial copies of the dataset, one in which treatment (qsmk) is set to 0 and one in which treatment is set to 1, respectively. Then, we take their average and compute the ATE.

fitted.0 <- predict(outcome_mod,
                    newdata = mutate(nhefs, qsmk = 0))

fitted.1 <- predict(outcome_mod,
                    newdata = mutate(nhefs, qsmk = 1))

E_1 <- mean(fitted.1)
E_0 <- mean(fitted.0)

treatment_effect <- E_1 - E_0

print(paste("E[Y(1)] - E[Y(0)] = ",round(treatment_effect,2)))

[1] "E[Y(1)] - E[Y(0)] =  3.52"

The estimated effect of smoking cessation, relative to no cessation, is a weight gain of 3.52 kg.

Confidence intervals can be obtained by bootstrapping. Note that less computationally costly options are also available, as demonstrated below.

parametric_g <- function(data, indices){

  df <- nhefs[indices,]

  outcome_mod <-
  lm(wt82_71 ~ qsmk + sex + race + poly(age,2) + education +
       poly(smokeintensity,2) + poly(smokeyrs,2) +
       exercise + active + poly(wt71,2) + qsmk:smokeintensity,
     data = df)

  fitted.0 <- predict(outcome_mod, newdata = df |> mutate(qsmk = 0))

  fitted.1 <- predict(outcome_mod, newdata = df |> mutate(qsmk = 1))

  E_1 <- mean(fitted.1)
  E_0 <- mean(fitted.0)

  treatment_effect <- E_1 - E_0

  return(treatment_effect)
}

boot_out <- boot(data = nhefs,
                 statistic = parametric_g,
                 R = 500)

boot.ci(boot_out, conf = 0.95, type = "perc")

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 500 bootstrap replicates

CALL : 
boot.ci(boot.out = boot_out, conf = 0.95, type = "perc")

Intervals : 
Level     Percentile     
95%   ( 2.486,  4.387 )  
Calculations and Intervals on Original Scale

R packages for the parametric g-formula

marginaleffects

The R package marginaleffects computes the ATE by standardisation, with standard errors obtained by the delta method:

avg_comparisons(outcome_mod, variables = "qsmk")


 Estimate Std. Error    z Pr(>|z|)    S 2.5 % 97.5 %
     3.52       0.44 7.99   <0.001 49.4  2.65   4.38

Term: qsmk
Type:  response 
Comparison: mean(1) - mean(0)
Columns: term, contrast, estimate, std.error, statistic, p.value, s.value, conf.low, conf.high, predicted_lo, predicted_hi, predicted

Clarify

The package clarify computes the ATE via standardisation and obtains confidence intervals by simulation. A useful quality of clarify’s simulation based inference is that it does not assume normality of the sampling distribution of the estimated quantity (unlike the delta method). This is useful for parameters with natural bounds like probabilities.

sim_coefs <- sim(outcome_mod)

sim_est <- sim_ame(sim_coefs, var = "qsmk", contrast = "diff", verbose = FALSE)

summary(sim_est)

        Estimate 2.5 % 97.5 %
E[Y(0)]     1.76  1.34   2.19
E[Y(1)]     5.27  4.60   6.03
Diff        3.52  2.71   4.38

The confidence intervals produced by all three methods are similar.

Beyond the core identification conditions described above, there is another essential assumption for the parametric g-formula to be unbiased. That is that the conditional mean outcome model is correctly specified. Even if you have meausured all important confounders there may still be bias if you have misspecified the model (e.g., missed important interactions or non-linearities). The method described in the next section relies on a different modelling assumption - that is, that the model for treatment is correctly specified.

IPTW

The aim of inverse probability of treatment weighting (IPTW) is to create a pseudo-population in which treatment $A$ and confounders $L$ are statistically independent. Provided identification conditions are satisfied, $E [Y^{a}]$ in the actual population is equal to $E_{p s} [Y | A = a]$ in the pseudo-population. The pseudo-population is created by weighting each individual by the inverse of the probability of receiving the treatment that they received.

For a binary treatment, the weights can be defined in terms of the propensity score (PS), which is the conditional probability of treatment given confounders $P r [A = 1 | L = l]$ . Weights $W^{A}$ are equal to $1 / P S$ for those who received treatment and $1 / (1 - P S)$ for the untreated. The reason the denominator is $1 - P S$ for the untreated is that $1 - P S$ is the conditional probability of being untreated, and we want to weight individuals by the inverse of the probability of receiving the treatment they actually received.

The steps for computing the ATE by IPTW are as follows:

Fit a model to estimate the PS (often a logistic regression model for a binary treatment)
Compute $W^{A}$ as $1 / P S$ and $1 / (1 - P S)$ for those who received treatment and those who were untreated, respectively
Estimate the ATE by fitting a weighted regression model for the outcome

Provided identifying conditions are satisfied, the model fitted to the pseudo-population created in step 3 has the form: $E [Y^{a}] = β_{0} + β_{1} a$ The parameter $β_{1}$ estimates the ATE. This is referred to as a marginal structural model. It is marginal because we are predicting a marginal potential outcome $E [Y^{a}]$ and structural as it is a model for a counterfactual, rather than fully observed, outcome.

IPT weighting ‘by hand’

The code below performs steps 1 and 2:

# fit model for treatment given confounders and compute weights

treatment_mod <-
  glm(qsmk ~ sex + race + poly(age,2) + education +
        poly(smokeintensity,2) + poly(smokeyrs,2) +
        exercise + active + poly(wt71,2),
      family = binomial(), data = nhefs)

ps <- predict(treatment_mod, type = "response") # propensity score

ip_weights <- ifelse(nhefs$qsmk == 1,
                     1 / ps,
                     1 / (1 - ps))

Now we can fit a weighted regression model to estimate the ATE. Because IP weighting must be taken into account for standard errors to be correct, we use a method which provides robust ‘sandwich’ type standard errors (bootstrapping is an another option).

# fit weighted model (step 3)

msm <- lm_robust(wt82_71 ~ qsmk,
                 data = nhefs,
                 weights = ip_weights)

summary(msm)


Call:
lm_robust(formula = wt82_71 ~ qsmk, data = nhefs, weights = ip_weights)

Weighted, Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept)    1.780     0.2248   7.917 4.582e-15    1.339    2.221 1564
qsmk           3.441     0.5265   6.535 8.574e-11    2.408    4.473 1564

Multiple R-squared:  0.0435 ,   Adjusted R-squared:  0.04289 
F-statistic: 42.71 on 1 and 1564 DF,  p-value: 8.574e-11

The IPTW estimate for smoking cessation is a 3.44 kg weight gain, with a 95% CI ranging from 2.4 to 4.5. Very similar to that obtained by the parametric g-formula. This is a reassuring sign as these two methods rely on different modeling assumptions.

It is worth noting that the IPTW ATE can be computed by an estimator which is very similar (though not quite identical) to the weighted regression above. This is the Horvitz-Thompson estimator:

${\hat{τ}}_{A T E}^{I P T W} = \frac{1}{N} \sum_{i = 1}^{N} \frac{A_{i} Y_{i}}{\hat{P S_{i}}} - \frac{(1 - A_{i}) Y_{i}}{1 - \hat{P S_{i}}}$ This is noted now as it will be relevant when discussing augmented inverse probability weighting (AIPW) later on.

R packages for IPTW

A particularly useful package for estimating the ATE via IPTW is the WeightIt package.

Though several methods are available for estimating weights, including machine learning algorithms and several other balancing methods which I do not presently understand, we’ll use logistic regression for consistency with the “by hand” approach:

ip_weights <-
  weightit(qsmk ~ sex + race + poly(age,2) + education +
             poly(smokeintensity,2) + poly(smokeyrs,2) +
             exercise + active + poly(wt71,2),
           method = "glm", data = nhefs,
           estimand = "ATE")

As discussed by Austin & Stuart (2015), it is best practise to assess that the estimated weights are working as they should and that measured confounders are balanced across levels of treatment in the pseudopopulation produced by the weights. This can be easily checked using The bal.tab function of the cobalt package.

We’ll use cobalt to assess balance in terms of standardised mean differences for continuous variables and proportion differences for binary variables. We’ll also compute the Kolmogorov–Smirnov statistic, as Austin and Stuart recommend.

bal.tab(ip_weights,
        stats = c("m", "ks"))

Balance Measures
                             Type Diff.Adj KS.Adj
prop.score               Distance   0.0098 0.0425
sex                        Binary  -0.0014 0.0014
race                       Binary   0.0023 0.0023
poly(age, 2)1             Contin.   0.0058 0.0293
poly(age, 2)2             Contin.   0.0034 0.0340
education_1                Binary  -0.0102 0.0102
education_2                Binary   0.0010 0.0010
education_3                Binary   0.0020 0.0020
education_4                Binary   0.0070 0.0070
education_5                Binary   0.0003 0.0003
poly(smokeintensity, 2)1  Contin.  -0.0241 0.0310
poly(smokeintensity, 2)2  Contin.  -0.0127 0.0307
poly(smokeyrs, 2)1        Contin.  -0.0035 0.0428
poly(smokeyrs, 2)2        Contin.   0.0109 0.0468
exercise_0                 Binary  -0.0046 0.0046
exercise_1                 Binary   0.0182 0.0182
exercise_2                 Binary  -0.0136 0.0136
active_0                   Binary  -0.0089 0.0089
active_1                   Binary   0.0141 0.0141
active_2                   Binary  -0.0051 0.0051
poly(wt71, 2)1            Contin.  -0.0090 0.0358
poly(wt71, 2)2            Contin.   0.0039 0.0515

Effective sample sizes
           Control Treated
Unadjusted 1163.    403.  
Adjusted   1128.61  325.97

All the confounders are well balanced.

Now that balance has been checked, a weighted regression model can be fitted in the same manner as before.

msm <- lm_robust(wt82_71 ~ qsmk,
                 data = nhefs,
                 weights = ip_weights$weights)

summary(msm)


Call:
lm_robust(formula = wt82_71 ~ qsmk, data = nhefs, weights = ip_weights$weights)

Weighted, Standard error type:  HC2 

Coefficients:
            Estimate Std. Error t value  Pr(>|t|) CI Lower CI Upper   DF
(Intercept)    1.780     0.2248   7.917 4.582e-15    1.339    2.221 1564
qsmk           3.441     0.5265   6.535 8.574e-11    2.408    4.473 1564

Multiple R-squared:  0.0435 ,   Adjusted R-squared:  0.04289 
F-statistic: 42.71 on 1 and 1564 DF,  p-value: 8.574e-11

Doubly robust estimation

Key assumptions of the parametric g-formula and IPTW estimators are that the models for the outcome and treatment, respectively, are correctly specified. Doubly robust (DR) estimators, on the other hand, combine models for both treatment and the outcome and by doing so, give the analyst two chances to get the model right. That is, a DR estimator will be consistent (in the statistical sense of converging to the true parameter as $n \to \infty$ ) so long as at least one of the outcome or treatment models are correctly specified.

While several DR estimators have been proposed, we will focus on augmented inverse probability weighting (AIPW) and targeted maximum likelihood estimation (TMLE), prominent examples which are well supported by current software.

AIPW

The AIPTW estimator is a doubly robust estimator of the ATE which, as the name suggests, “augments” the IPTW estimator with additional terms which improve efficiency and provide double robustness. Before covering AIPW in more detail, it is helpful to return to the previously mentioned IPTW estimator of the ATE: ${\hat{τ}}_{A T E}^{I P T W} = \frac{1}{N} \sum_{i = 1}^{N} \frac{A_{i} Y_{i}}{\hat{P S_{i}}} - \frac{(1 - A_{i}) Y_{i}}{1 - \hat{P S_{i}}}$

Where $P S$ is the estimated propensity score. The AIPW estimator augments this IPTW estimator by additionally incorporating a model for the outcome in its estimation. The estimator is written as:

${\hat{τ}}_{A T E}^{A I P W} = \frac{1}{n} \sum_{i = 1}^{n} (\frac{A_{i} Y_{i}}{\hat{P S_{i}}} - \frac{A_{i} - \hat{P S_{i}}}{\hat{P S_{i}}} {\hat{Y_{1}}}_{i}) - (\frac{(1 - A_{i}) Y_{i}}{1 - \hat{P S_{i}}} - \frac{A_{i} - \hat{P S_{i}}}{1 - \hat{P S_{i}}} {\hat{Y_{0}}}_{i})$ ${\hat{Y_{1}}}_{i}$ and ${\hat{Y_{0}}}_{i}$ indicate the predicted conditional mean of the outcome when treatment is given and not given, respectively (these correspond to ‘fitted.1’ and ‘fitted.0’ that were computed as part of the g-formula section).

The parts on the left side of each set of big brackets are identical to the IPTW estimator above. The parts on the right are the augmentation components. These components provide the double robustness. That is, so long as either the model for treatment or the outcome are correctly specified, the estimator will be consistent (a proof of this property is given in Funk et al., (2011)).

AIPW ‘by hand’

All of the required components for applying AIPW, the propensity score and conditional mean of the outcome under treated and untreated, were estimated in previous steps. Nonetheless, we’ll estimate them again for sake of completeness.

First, fit the treatment and outcome models and compute ${\hat{Y_{1}}}_{i}$ , ${\hat{Y_{0}}}_{i}$ and ${\hat{P S}}_{i}$ :

outcome_mod <-
  lm(wt82_71 ~ qsmk + sex + race + poly(age,2) + education +
       poly(smokeintensity,2) + poly(smokeyrs,2) +
       exercise + active + poly(wt71,2) + qsmk:smokeintensity,
     data = nhefs)

fitted.0 <- predict(outcome_mod,
                    newdata = nhefs |> mutate(qsmk = 0))

fitted.1 <- predict(outcome_mod,
                    newdata = nhefs |> mutate(qsmk = 1))

treatment_mod <-
  glm(qsmk ~ sex + race + poly(age,2) + education +
        poly(smokeintensity,2) + poly(smokeyrs,2) +
        exercise + active + poly(wt71,2),
      family = binomial(), data = nhefs)

ps <- predict(treatment_mod, type = "response") # propensity score

Secondly, compute the AIPW ATE:

aipw <- function(A, Y, PS, Y_1, Y_0){
  out <- mean((((A*Y)/PS) - ((A - PS)/PS) * Y_1) -
                 ((((1-A)*Y)/(1-PS)) - ((A - PS)/(1-PS)) * Y_0))

  out
}

aipw_out <- aipw(nhefs$qsmk, nhefs$wt82_71,
                 ps, fitted.1, fitted.0)

paste("AIPW ATE =", round(aipw_out,2))

[1] "AIPW ATE = 3.43"

Confidence intervals can be obtained by bootstrapping or based on the efficient influence curve.

AIPW using SuperLearner

Previously, we have estimated causal effects using the standard parametric models linear and logistic regression (this was by no means a requirement, the g-formula and IPTW are compatible with other choices but these were chosen for ease of demonstration). A problem with these simple parametric models is that they tend to impose restrictions on the manner in which covariates are related to the treatment or outcome. For instance, they often assume there are no or few interactions and that continuous variables are linearly related to the outcome/treatment. Consequently, some amount of model mispecification is very likely in real settings. By contrast, highly flexible machine learning models impose few such restrictions and may be less prone to misspecification related bias.

To venture into the world of machine learning, we will apply the AIPW estimator using the R package AIPW. AIPW makes use of the SuperLearner algorithm to estimate ${\hat{Y_{1}}}_{i}$ , ${\hat{Y_{0}}}_{i}$ , and $\hat{P S_{i}}$ in the above expression. SuperLearner is a stacking algorithm which uses several user specified machine learning algorithms, estimates their performance using cross-validation, and then computes an optimally weighted average of those models predictions (an “ensemble”).

To estimate the effect of smoking cessation, we will use AIPW with generalised linear models, penalised maximum likelihood (glmnet), random forests, and Bayesian adaptive regression trees (BART) included in the SuperLearner algorithm. Standard errors are computed based on the efficient influence curve.

#! cache: true

confounds <-
  select(nhefs, sex, race, age, education,
         smokeintensity, smokeyrs, exercise, active, wt71)

SL.libs <- c("SL.glm","SL.glmnet",
             "SL.ranger","SL.bartMachine")

aipw_sl <- AIPW$new(Y=nhefs$wt82_71,
                    A=nhefs$qsmk,
                    W.Q=confounds,
                    W.g=confounds,
                    Q.SL.library=SL.libs,
                    g.SL.library=SL.libs,
                    k_split=5,
                    verbose=FALSE)

aipw_sl$fit()

aipw_sl$summary()

aipw_sl$result

                 Estimate        SE  95% LCL  95% UCL    N
Risk of exposure 4.907530 0.6708074 3.592747 6.222312  403
Risk of control  1.751597 0.2178300 1.324650 2.178543 1163
Risk Difference  3.155933 0.6989296 1.786031 4.525835 1566

The results are very similar to those obtained by the previous methods.

TMLE

TMLE is a doubly robust procedure for the estimation of causal effects which is similar to AIPW. Indeed, the two are asymptotically equivalent. In practical settings, however, TMLE can sometimes outperform AIPW as it is less sensitive to the extreme weights which can occur under random nonpositivity.

Applying the TMLE estimator proceeds like so:

Fit a model for the the outcome, including treatment and confounders, to obtain ${\hat{Y_{1}}}_{i}$ and ${\hat{Y_{0}}}_{i}$ . The predicted value of the outcome under treatment given and treatment not given, respectively.
Fit a model for treatment and estimate the propensity score ${\hat{P S}}_{i}$
Compute the ‘clever covariates’ $H 1_{i}$ and $H 0_{i}$ which are equal to $A_{i} / {\hat{P S}}_{i}$ and $(1 - A_{i}) / (1 - {\hat{P S}}_{i})$ , respectively
Fit a new model for the residuals of $Y$ from the outcome model including clever covariates $H 1$ and $H 0$
‘Update’ the estimation of ${\hat{Y_{1}}}_{i}$ , and ${\hat{Y_{0}}}_{i}$ using an expression which incorprates the estimated coefficients for the clever covariates $H 1$ and $H 0$ . That is, ${\hat{Y_{1}}}^{n e w} = {\hat{Y_{1}}}_{i}^{o l d} + \frac{\hat{ϵ_{1}}}{\hat{P S_{i}}}$ ${\hat{Y_{0}}}^{n e w} = {\hat{Y_{0}}}_{i}^{o l d} + \frac{\hat{ϵ_{2}}}{1 - \hat{P S_{i}}}$
Compute the ATE based on the updated estimates ${\hat{Y_{1}}}_{i}^{n e w}$ , and ${\hat{Y_{0}}}_{i}^{n e w}$

The reason that $H 1_{i}$ and $H 0_{i}$ are referred to as ‘clever covariates’ is that they provide double robustness and allow for inference based on the efficient influence curve.

TMLE ‘by hand’

First, we fit a model for the outcome and compute the desired predictions:

outcome_mod <-
  lm(wt82_71 ~ qsmk + sex + race + poly(age,2) + education +
       poly(smokeintensity,2) + poly(smokeyrs,2) +
       exercise + active + poly(wt71,2) + qsmk:smokeintensity,
     data = nhefs)

fitted.0 <- predict(outcome_mod,
                    newdata = mutate(nhefs, qsmk = 0))

fitted.1 <- predict(outcome_mod,
                    newdata = mutate(nhefs, qsmk = 1))

Then we fit a model for treatment, obtain the propensity score and compute the clever covariates.

treatment_mod <-
  glm(qsmk ~ sex + race + poly(age,2) + education +
        poly(smokeintensity,2) + poly(smokeyrs,2) +
        exercise + active + poly(wt71,2),
      family = binomial(), data = nhefs)

ps <- predict(treatment_mod, type = "response") # propensity score

H0 <- (1-nhefs$qsmk)/(1-ps)
H1 <- nhefs$qsmk / ps

Then, fit the model to the residuals of Y including clever covariates:

resid <- residuals(outcome_mod)

resid_mod <- lm(resid ~ H0 + H1)

Finally, estimate ${\hat{Y_{1}}}^{n e w}$ and ${\hat{Y_{0}}}^{n e w}$ and compute the ATE:

fitted.1.new <- fitted.1 + (coef(resid_mod)["H1"]/ps)
fitted.0.new <- fitted.0 + (coef(resid_mod)["H0"]/(1-ps))

ATE_out <- mean(fitted.1.new - fitted.0.new)

paste("TMLE ATE =", round(ATE_out, 2))

[1] "TMLE ATE = 3.46"

Like AIPW, confidence intervals could be obtained based on the efficient influence curve or by bootstrapping. See Luque‐Fernandez et al (2017) for a demonstration of the former.

TMLE using SuperLearner

For a very final demonstration, we will estimate the TMLE ATE using the R package tmle. Again, we will use SuperLearner as the prediction algorithm, using the same set of libraries as for AIPW.

#! cache: true

SL.libs <- c("SL.glm","SL.glmnet",
             "SL.ranger","SL.bartMachine")

tmle.out <- tmle(Y = nhefs$wt82_71,
                 A = nhefs$qsmk,
                 W = confounds,
                 family = "gaussian",
                 V.Q = 5,
                 V.g = 5,
                 Q.SL.library = SL.libs,
                 g.SL.library = SL.libs)

tmle.out

 Marginal mean under treatment (EY1)
   Parameter Estimate:  5.1703
   Estimated Variance:  0.17413
              p-value:  <2e-16
    95% Conf Interval:  (4.3524, 5.9882)

 Marginal mean under comparator (EY0)
   Parameter Estimate:  1.7488
   Estimated Variance:  0.046878
              p-value:  6.6289e-16
    95% Conf Interval:  (1.3244, 2.1732)

 Additive Effect
   Parameter Estimate:  3.4215
   Estimated Variance:  0.21337
              p-value:  1.2912e-13
    95% Conf Interval:  (2.5161, 4.3268)

 Additive Effect among the Treated
   Parameter Estimate:  3.4633
   Estimated Variance:  0.21727
              p-value:  1.0855e-13
    95% Conf Interval:  (2.5497, 4.3769)

 Additive Effect among the Controls
   Parameter Estimate:  3.3705
   Estimated Variance:  0.21767
              p-value:  5.0397e-13
    95% Conf Interval:  (2.4561, 4.2849)

Again, results are very similar.

Conclusion

In summary, here are a few of what I think are a few key takeaways:

Even when conditions for causal identification are met, unbiased estimation of causal effects requires correct model specification
The parametric g-formula and IPTW can be straightforwardly applied using tools most researchers are already familiar with
Doubly robust estimators are straightforward extensions of the g-formula and IPTW and, as they readily allow for the incorporation of machine learning libraries, should generally be preferred over singly robust alternatives

In part 2, I will cover the case of time varying treatments and demonstrate how g-methods and their doubly robust extensions become essential in this setting.

References

Hernán, M. A., & Taubman, S. L. (2008). Does obesity shorten life? The importance of well-defined interventions to answer causal questions. International journal of obesity, 32(3), S8-S14.

Hernán, M. A., & Robins, J. M. (2010). Causal inference.

Arel-Bundock V (2023). marginaleffects: Predictions, Comparisons, Slopes, Marginal Means, and Hypothesis Tests. R package version 0.9.0, https://vincentarelbundock.github.io/marginaleffects/.

Greifer N, Worthington S, Iacus S, King G (2023). clarify: Simulation-Based Inference for Regression Models. https://github.com/iqss/clarify, https://iqss.github.io/clarify/.

Austin, P. C., & Stuart, E. A. (2015). Moving towards best practice when using inverse probability of treatment weighting (IPTW) using the propensity score to estimate causal treatment effects in observational studies. Statistics in medicine, 34(28), 3661-3679.

Greifer N (2023). WeightIt: Weighting for Covariate Balance in Observational Studies. https://ngreifer.github.io/WeightIt/, https://github.com/ngreifer/WeightIt.

Greifer N (2023). cobalt: Covariate Balance Tables and Plots. https://ngreifer.github.io/cobalt/, https://github.com/ngreifer/cobalt.

Michele Jonsson Funk and others, Doubly Robust Estimation of Causal Effects, American Journal of Epidemiology, Volume 173, Issue 7, 1 April 2011, Pages 761–767, https://doi.org/10.1093/aje/kwq439

Yongqi Zhong and others, AIPW: An R Package for Augmented Inverse Probability–Weighted Estimation of Average Causal Effects, American Journal of Epidemiology, Volume 190, Issue 12, December 2021, Pages 2690–2699, https://doi.org/10.1093/aje/kwab207

Van der Laan, M. J., Polley, E. C., & Hubbard, A. E. (2007). Super learner. Statistical applications in genetics and molecular biology, 6(1).

Luque‐Fernandez, M. A., Schomaker, M., Rachet, B., & Schnitzer, M. E. (2018). Targeted maximum likelihood estimation for a binary treatment: A tutorial. Statistics in medicine, 37(16), 2530-2546.