# load
library(faux)
library(lme4)
library(tidyverse)
library(furrr)
library(purrr)
library(ggdist)
library(marginaleffects)
library(glmmTMB)
# adjust the plotting theme
theme_set(
theme_linedraw() +
theme(panel.grid = element_blank(),
strip.background = element_rect(fill = "grey92", color = "grey92"),
strip.text = element_text(color = "black", size = 10)))
Intro
This is a response to and extension of the recent blog post by Solomon Kurz (hereafter, SK) which contrasted a set of four modeling options for pre-post RCTs. In short, the four models in R syntax were:
- classic change score model: Y2 - Y1 ~ treatment
- classic ANCOVA model: Y2 ~ Y1 + treatment
- linear mixed model (LMM): Y ~ treatment*time + (1 | ID)
- LMM ANCOVA: Y ~ time + treatment:time + (1 | ID)
The four models were compared in a simulation study for bias and efficiency. Here, I add a minor extension to the simulation study to see how things change when baseline has a truncated distribution, which is often the case in realistic RCT’s due to strict exclusion criteria.
My suspicion is that the classic glm() ANCOVA model will emerge as the best option in this context. The reason being that the LMM’s must include the baseline score as part of the outcome vector, leading it to contain effectively two different distributions. This data generating process is a challenge for LMM’s to represent.
Packages are first loaded and plotting theme set:
Simulate fake BDI-II data
The BDI-II (Beck Depression Inventory) is a widely used tool which is employed to measure the severity of depression symptoms. In depression RCT’s, usually a person has to meet a minimum level of severity to be eligible for inclusion. For instance, a BDI score > 18, indicating moderate-severe depression, is often used as a cut-off. Note that these cut-offs are not limited to psychiatry - for e.g., patients may need to have sufficiently high blood pressure to be eligible for testing an antihypertensive.
We presume that the population expressing interest in the RCT has a mean BDI score of 20 and standard deviation of 6. For this first example, we will apply a threshold of 20 for inclusion. This means that on average half of those individuals in the application pool will be below the threshold for participating.
<- function(seed = seed, n = n, tau = tau, rho = rho,
sim_data threshold = threshold) {
# population values
<- 20
m <- 6
s
# simulate and save
set.seed(seed)
<- rnorm_multi(
d n = n,
mu = c(m, m),
sd = c(s, s),
r = rho,
varnames = list("pre", "post")
%>%
) mutate(tx = rep(0:1, each = n / 2)) %>%
mutate(post = ifelse(tx == 1, post + tau, post))
# apply exclusion criteria
%>% filter(pre > threshold)
d
}
An example dataset is created. We set the treatment effect to a reduction of BDI score of 6 points.
# generate data
<- sim_data(seed = 1, n = 300, tau = -6, rho = .5,
dw threshold = 20)
# long format
<- dw %>%
dl mutate(id = 1:n()) %>%
pivot_longer(pre:post,
names_to = "wave",
values_to = "y") %>%
mutate(time = ifelse(wave == "pre", 0, 1))
Plot of histogram displaying baseline truncation
# histograms
$wave <- factor(dl$wave, c("pre", "post"))
dl
%>%
dl ggplot(aes(y = y)) +
geom_histogram(color = "#000000", fill = "#0099F8") +
facet_wrap(~ wave) +
coord_flip()
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The next plot shows change in symptoms over time. Participants in the control group tend (slightly) toward symptom improvement - despite receiving no treatment. This is typical of depression RCTs. Only individuals who were quite unwell at the time of recruitment could paricipate, after which they may tend to drop down toward normal levels (i.e., regression to the mean). Of course, other factors, such as placebo effects, cause control participants to improve across the trial, but regression to the mean is the only factor at play in this simulation.
# example trends
%>%
dl ggplot(aes(x = time, y = y)) +
geom_line(aes(group = id),
size = 1/4, alpha = 3/4) +
stat_smooth(method = "lm", se = F, size = 3, formula = y ~ x) +
scale_x_continuous(breaks = 0:1, labels = c("0 (pre)", "1 (post)"), expand = c(0.1, 0.1)) +
scale_y_continuous(sec.axis = dup_axis(name = NULL)) +
ggtitle(expression("100 draws from the population for which "*tau==-6)) +
facet_wrap(~ tx, labeller = label_both)
Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
ℹ Please use `linewidth` instead.
Simulation study
We now replicate SK’s simulation study. As in the original, we use two values for the pre-post correlation, though we use 0.7 in place of 0.8. Secondly, we apply two minimum severity thresholds, designed to cover a range of realistic scenarios. In the first, interested prospective participants must have a BDI score > 16. This excludes about 1/4 of the applicants from being randomised. In the second, a BDI score > 20 is required, excluding half of the pool of potential participants.
The sample sizes of study applicants is inflated to keep the final, randomised sample to approximately 100 participants.
<- function(seed = seed, n = n, tau = -6, rho = rho,
sim_fit threshold = threshold) {
# population values
<- 20
m <- 6
s
# simulate wide
set.seed(seed)
<-
dw rnorm_multi(
n = n,
mu = c(m, m),
sd = c(s, s),
r = rho,
varnames = list("pre", "post")
%>%
) mutate(tx = rep(0:1, each = n / 2)) %>%
mutate(post = ifelse(tx == 1, post + tau, post))
# apply exclusion criteria
<-
dw %>% filter(pre > threshold)
dw
# make long
<- dw %>%
dl mutate(id = 1:n()) %>%
pivot_longer(pre:post,
names_to = "wave",
values_to = "y") %>%
mutate(time = ifelse(wave == "pre", 0, 1))
# fit the models
<- glm(
w1 data = dw,
family = gaussian,
- pre) ~ 1 + tx)
(post
<- glm(
w2 data = dw,
family = gaussian,
~ 1 + pre + tx)
post
<- lmer(
l1 data = dl,
~ 1 + tx + time + tx:time + (1 | id))
y
<- lmer(
l2 data = dl,
~ 1 + time + tx:time + (1 | id))
y
# summarize
bind_rows(
::tidy(w1,conf.int=T)[2, c(2:4,6:7)],
broom::tidy(w2,conf.int=T)[3, c(2:4,6:7)],
broom::tidy(l1, conf.int=T)[4, c(4:6,7:8)],
broom.mixed::tidy(l2, conf.int=T)[3, c(4:6,7:8)]) %>%
broom.mixedmutate(method = rep(c("glm()", "lmer()"), each = 2),
model = rep(c("change", "ANCOVA"), times = 2))
}
Now, 1000 unique datasets are simulated for each combination of correlation (rho) and threshold. The treatment effect parameter is extracted from each. The simulations are run in parallel over 4 threads using the furrr package.
plan(multisession, workers = 4)
# rho = 0.4
4.16 <- tibble(seed = 1:1000) %>%
sim.mutate(tidy = future_map(seed, sim_fit, rho = .4,
threshold = 16, n = 160)) %>%
unnest(tidy)
plan(multisession, workers = 4)
4.20 <- tibble(seed = 1:1000) %>%
sim.mutate(tidy = future_map(seed, sim_fit, rho = .4,
threshold = 20, n = 300)) %>%
unnest(tidy)
# rho = .7
plan(multisession, workers = 4)
7.16 <- tibble(seed = 1:1000) %>%
sim.mutate(tidy = future_map(seed, sim_fit, rho = .7,
threshold = 16, n = 160)) %>%
unnest(tidy)
plan(multisession, workers = 4)
7.20 <- tibble(seed = 1:1000) %>%
sim.mutate(tidy = future_map(seed, sim_fit, rho = .7,
threshold = 20, n = 300)) %>%
unnest(tidy)
Results
The bias in the estimation of the treatment effect is displayed for each of the combinations of method, correlation, and inclusion threshold.
bind_rows(
4.16 %>% mutate(rho = .4, threshold = 16),
sim.4.20 %>% mutate(rho = .4, threshold = 20),
sim.7.16 %>% mutate(rho = .7, threshold = 16),
sim.7.20 %>% mutate(rho = .7, threshold = 20)) %>%
sim.mutate(type = str_c(model, ", ", method),
rho = str_c("rho==", rho),
threshold = str_c("threshold ==", threshold)) %>%
ggplot(aes(x = estimate, y = type, slab_color = stat(x), slab_fill = stat(x))) +
geom_vline(xintercept = -6, color = "grey67") +
stat_dotsinterval(.width = .5, slab_shape = 22) +
labs(title = "Parameter bias by model, pre/post correlation, and threshold",
x = expression(hat(tau)*" (causal effect point estimate)"),
y = NULL) +
scale_slab_fill_continuous(limits = c(0, NA)) +
scale_slab_color_continuous(limits = c(0, NA)) +
coord_cartesian(ylim = c(1.4, NA)) +
theme(axis.text.y = element_text(hjust = 0),
legend.position = "none") +
facet_wrap(vars(rho, threshold), labeller = label_parsed)
Warning: `stat(x)` was deprecated in ggplot2 3.4.0.
ℹ Please use `after_stat(x)` instead.
All of the methods are unbiased.
The efficiency (standard error of the treatment effect) is displayed for each of the combinations of method, correlation, and inclusion threshold.
### Efficiency
bind_rows(
4.16 %>% mutate(rho = .4, threshold = 16),
sim.4.20 %>% mutate(rho = .4, threshold = 20),
sim.7.16 %>% mutate(rho = .7, threshold = 16),
sim.7.20 %>% mutate(rho = .7, threshold = 20)) %>%
sim.mutate(type = str_c(model, ", ", method),
rho = str_c("rho==", rho),
threshold = str_c("threshold ==", threshold)) %>%
ggplot(aes(x = std.error, y = type, slab_color = stat(x), slab_fill = stat(x))) +
stat_dotsinterval(.width = .5, slab_shape = 22) +
labs(title = "Parameter efficiency by model, pre/post correlation, and threshold",
x = expression(tau[s.e.]*" (causal effect standard error)"),
y = NULL) +
scale_slab_fill_continuous(limits = c(0, NA)) +
scale_slab_color_continuous(limits = c(0, NA)) +
coord_cartesian(ylim = c(1.4, NA)) +
theme(axis.text.y = element_text(hjust = 0),
legend.position = "none") +
facet_wrap(vars(rho, threshold), labeller = label_parsed)
One thing is clearly similar between this and SK’s original simultaion: the ANCOVA and LMM ANCOVA models clearly beat the change models in terms of efficiency, especially when pre-post correlation is relatively low. There is also a major difference. That is, the LMM ANCOVA method stands out as being the most efficient, particularly when baseline truncation is strong.
Lastly, the percentage coverage of the 95% CI for each model type is compared.
bind_rows(
4.16 %>% mutate(rho = .4, threshold = 16),
sim.4.20 %>% mutate(rho = .4, threshold = 20),
sim.7.16 %>% mutate(rho = .7, threshold = 16),
sim.7.20 %>% mutate(rho = .7, threshold = 20)) %>%
sim.mutate(type = str_c(model, ", ", method),
rho = str_c("rho==", rho),
threshold = str_c("threshold =", threshold)) %>%
mutate(covered = if_else(conf.low < -6 & conf.high > -6, 1, 0)) %>%
group_by(type, rho, threshold, covered) %>%
tally() %>%
mutate(coverage = n/sum(n) * 100) %>%
filter(covered == 1) %>%
ggplot(aes(x = type, y = coverage, colour = threshold)) +
geom_hline(aes(yintercept = 94), colour = "grey") +
geom_hline(aes(yintercept = 96), colour = "grey") +
geom_point(position = position_dodge(0.25)) +
facet_wrap(~rho, labeller = label_parsed) +
theme(axis.text.x = element_text(angle = 90)) +
ylim(c(85,100)) +
labs(y = "Coverage %", x = "")
Under most conditions, each of the methods produces intervals with nominal coverage. Notably, however, the ANCOVA LMM method tends to produce confidence intervals which are somewhat too narrow. And this seems to be worse in cases of low pre-post correlation and when the degree of baseline truncation is large.
Why truncation appears to produce overly confident inferences from the ANCOVA LMM in particular is, to me, somewhat of a mystery. One possibility is that the error variance at baseline is considerably lower than at post test, and this heteroscedasticity is causing problems for the ANCOVA LMM in particular.
Distributional model with glmmTMB
To investigate this possibility, using the example dataset created above, we fit a distributional model in which the error variance is allowed to vary by time in the LMM ANCOVA.
The model, in statistical notation is,
\[ \begin{align*} y_{it} & \sim \mathcal N(\mu_{it}, \sigma_{e_{it}}) \\ \mu _{it} & = \beta_0 + \beta_1 \text{time}_{it} + \beta_2 \text{tx}_{it}\text{time}_{it} + u_{0i} \\ u_{0i} & \sim \mathcal N(0, \sigma_0) \\ \log(\sigma_{e_{it}}) & \sim \delta_0 + \delta_1\text{time}_{it} \end{align*} \]
The model is fit using R package glmmTMB:
<-
dm1 glmmTMB(
data = dl,
~ 1 + time + tx:time + (1 | id),
y dispformula = ~ time, REML = T)
summary(dm1)
Family: gaussian ( identity )
Formula: y ~ 1 + time + tx:time + (1 | id)
Dispersion: ~time
Data: dl
AIC BIC logLik deviance df.resid
1728.4 1750.6 -858.2 1716.4 290
Random effects:
Conditional model:
Groups Name Variance Std.Dev.
id (Intercept) 6.632 2.575
Residual NA NA
Number of obs: 296, groups: id, 148
Conditional model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 24.8620 0.3117 79.76 < 2e-16 ***
time -3.0291 0.6276 -4.83 1.39e-06 ***
time:tx -5.2394 0.8498 -6.17 7.02e-10 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Dispersion model:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 1.0236 0.1195 8.564 < 2e-16 ***
time 0.5465 0.1394 3.921 8.82e-05 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
There is strong evidence that the residual variance, sigma, varies between baseline and follow-up - the model estimate \(\delta_1\) (modelled on the log scale) indicates that residual variation at follow-up is exp(1.09) = 3-fold higher at follow-up than baseline.
Comparing the classic LMM with the distributional model by AIC:
<- lmer(
l1 data = dl,
~ 1 + time + tx:time + (1 | id))
y
AIC(l1, dm1)
df AIC
l1 5 1747.420
dm1 6 1728.437
There is evidence that the distributional model produces better fit to data (lower AIC).
To contrast this distributional model with the previously described methods, we extend the simulation study described above to include the models fit via glmmTMB.
<- function(seed = seed, n = n, tau = -6, rho = rho,
sim_fit threshold = threshold) {
# population values
<- 20
m <- 6
s
# simulate wide
set.seed(seed)
<-
dw rnorm_multi(
n = n,
mu = c(m, m),
sd = c(s, s),
r = rho,
varnames = list("pre", "post")
%>%
) mutate(tx = rep(0:1, each = n / 2)) %>%
mutate(post = ifelse(tx == 1, post + tau, post))
# apply exclusion criteria
<-
dw %>% filter(pre > threshold)
dw
# make long
<- dw %>%
dl mutate(id = 1:n()) %>%
pivot_longer(pre:post,
names_to = "wave",
values_to = "y") %>%
mutate(time = ifelse(wave == "pre", 0, 1))
# fit the models
<- glmmTMB(
dm1 data = dl,
~ 1 + tx + time + tx:time + (1 | id),
y dispformula = ~ time, REML = T)
<- glmmTMB(
dm2 data = dl,
~ 1 + time + tx:time + (1 | id),
y dispformula = ~ time, REML = T)
# summarize
bind_rows(
::tidy(dm1, conf.int=T)[4,c(5:7,9:10)],
broom.mixed::tidy(dm2, conf.int=T)[3,c(5:7,9:10)]) %>%
broom.mixedmutate(method = rep("glmmTMB", each = 2),
model = c("change", "ANCOVA"))
}
Fit the distributional model to 1000 simulated datasets:
plan(multisession, workers = 4)
# rho = 0.4
4.16.d <- tibble(seed = 1:1000) %>%
sim.mutate(tidy = future_map(seed, sim_fit, rho = .4,
threshold = 16, n = 160)) %>%
unnest(tidy)
plan(multisession, workers = 4)
4.20.d <- tibble(seed = 1:1000) %>%
sim.mutate(tidy = future_map(seed, sim_fit, rho = .4,
threshold = 20, n = 300)) %>%
unnest(tidy)
# rho = .7
plan(multisession, workers = 4)
7.16.d <- tibble(seed = 1:1000) %>%
sim.mutate(tidy = future_map(seed, sim_fit, rho = .7,
threshold = 16, n = 160)) %>%
unnest(tidy)
plan(multisession, workers = 4)
7.20.d <- tibble(seed = 1:1000) %>%
sim.mutate(tidy = future_map(seed, sim_fit, rho = .7,
threshold = 20, n = 300)) %>%
unnest(tidy)
Results part 2
The bias in the estimation of the treatment effect is displayed for each of the combinations of method, correlation, and inclusion threshold.
bind_rows(
4.16 %>% mutate(rho = .4, threshold = 16),
sim.4.20 %>% mutate(rho = .4, threshold = 20),
sim.7.16 %>% mutate(rho = .7, threshold = 16),
sim.7.20 %>% mutate(rho = .7, threshold = 20),
sim.4.16.d %>% mutate(rho = .4, threshold = 16),
sim.4.20.d %>% mutate(rho = .4, threshold = 20),
sim.7.16.d %>% mutate(rho = .7, threshold = 16),
sim.7.20.d %>% mutate(rho = .7, threshold = 20)) %>%
sim.mutate(type = str_c(model, ", ", method),
rho = str_c("rho==", rho),
threshold = str_c("threshold ==", threshold)) %>%
ggplot(aes(x = estimate, y = type, slab_color = stat(x), slab_fill = stat(x))) +
geom_vline(xintercept = -6, color = "grey67") +
stat_dotsinterval(.width = .5, slab_shape = 22) +
labs(title = "Parameter bias by model, pre/post correlation, and threshold",
x = expression(hat(tau)*" (causal effect point estimate)"),
y = NULL) +
scale_slab_fill_continuous(limits = c(0, NA)) +
scale_slab_color_continuous(limits = c(0, NA)) +
coord_cartesian(ylim = c(1.4, NA)) +
theme(axis.text.y = element_text(hjust = 0),
legend.position = "none") +
facet_wrap(vars(rho, threshold), labeller = label_parsed)
Again, there is no evidence of bias associated with any of the methods.
The efficiency (standard error of the treatment effect) is again displayed for each of the methods.
### Efficiency
bind_rows(
4.16 %>% mutate(rho = .4, threshold = 16),
sim.4.20 %>% mutate(rho = .4, threshold = 20),
sim.7.16 %>% mutate(rho = .7, threshold = 16),
sim.7.20 %>% mutate(rho = .7, threshold = 20),
sim.4.16.d %>% mutate(rho = .4, threshold = 16),
sim.4.20.d %>% mutate(rho = .4, threshold = 20),
sim.7.16.d %>% mutate(rho = .7, threshold = 16),
sim.7.20.d %>% mutate(rho = .7, threshold = 20)) %>%
sim.mutate(type = str_c(model, ", ", method),
rho = str_c("rho==", rho),
threshold = str_c("threshold ==", threshold)) %>%
ggplot(aes(x = std.error, y = type, slab_color = stat(x), slab_fill = stat(x))) +
stat_dotsinterval(.width = .5, slab_shape = 22) +
labs(title = "Parameter efficiency by model, pre/post correlation, and threshold",
x = expression(tau[s.e.]*" (causal effect standard error)"),
y = NULL) +
scale_slab_fill_continuous(limits = c(0, NA)) +
scale_slab_color_continuous(limits = c(0, NA)) +
coord_cartesian(ylim = c(1.4, NA)) +
theme(axis.text.y = element_text(hjust = 0),
legend.position = "none") +
facet_wrap(vars(rho, threshold), labeller = label_parsed)
It seems that the classic change LMM, the distributional change LMM, and the glm change model are all equivalent in terms of efficiency.They are clearly outperformed by their ANCOVA counterparts in all cases.
But here is the interesting result. When extending the LMM ANCOVA to include a model for sigma (residual variance), the efficiency is equivalent to that of the ANCOVA glm.
Here we compare coverage of the 95% intervals for each method:
bind_rows(
4.16 %>% mutate(rho = .4, threshold = 16),
sim.4.20 %>% mutate(rho = .4, threshold = 20),
sim.7.16 %>% mutate(rho = .7, threshold = 16),
sim.7.20 %>% mutate(rho = .7, threshold = 20),
sim.4.16.d %>% mutate(rho = .4, threshold = 16),
sim.4.20.d %>% mutate(rho = .4, threshold = 20),
sim.7.16.d %>% mutate(rho = .7, threshold = 16),
sim.7.20.d %>% mutate(rho = .7, threshold = 20)) %>%
sim.mutate(type = str_c(model, ", ", method),
rho = str_c("rho==", rho),
threshold = str_c("threshold =", threshold)) %>%
mutate(covered = if_else(conf.low < -6 & conf.high > -6, 1, 0)) %>%
group_by(type, rho, threshold, covered) %>%
tally() %>%
mutate(coverage = n/sum(n) * 100) %>%
filter(covered == 1) %>%
ggplot(aes(x = type, y = coverage, colour = threshold)) +
geom_hline(aes(yintercept = 94), colour = "grey") +
geom_hline(aes(yintercept = 96), colour = "grey") +
geom_point(position = position_dodge(0.25)) +
facet_wrap(~rho, labeller = label_parsed) +
theme(axis.text.x = element_text(angle = 90)) +
ylim(c(85,100)) +
labs(y = "Coverage %", x = "")
Again, all methods, excluding the ANCOVA lmm, produce nominal coverage. But it seems that by adding a model for sigma to the ANCOVA lmm, thereby accounting for the unequal variance between time-points, has solved its undercoverage problem.
Conclusion
Under these specific conditions, a couple of clear themes emerge:
The typical glm ANCOVA is a safe bet - producing good efficiency and nominal coverage in all situations considered.
The LMM ANCOVA model is anticonservative in the context of baseline truncation, and this is worse with higher baseline truncation and lower pre-post correlation.
The anticonservatism of the ANCOVA lmm, when baseline has a truncated distribution, can be rescued by allowing the residual variance to vary by time, as in a glmmTMB distributional model.