Meta-Analysis

A meta-analysis is the “statistical analysis of a large collection of analysis results from individual studies for the purpose of integrating the findings” (Glass, 1976). Meta-analysis is a standard tool for producing summaries of research findings in medicine and other fields.

Meta-analysis can be useful when studies yield potentially conflicting results, when sample sizes in individual studies are modest, when events are rare, and in general to summarize a literature.

TB Studies

Hierarchical models are often used as part of meta-analysis. On the next page we examine a forest plot illustrating the results of 13 studies evaluating the efficacy of a vaccine (BCG) for preventing tuberculosis. The figure shows the relative risk of tuberculosis infection in the treated versus control groups in 13 studies (along with CI’s from each study) as well as a summary estimate and 95% CI based on a random effects meta-analysis.

You can click here to see where the vaccine is given. The vaccine is not currently given in the US due to low TB prevalence.

Random Effects Meta-Analysis

A random effects meta analysis typically assumes the model \[y_i=\theta_i+e_i~~~~\theta_i=\mu+b_i ~~~ b_i \sim N(0,\tau^2),\] where \(y_i\) is the effect estimate (possibly transformed) from study \(i\), \(e_i \sim N(0,v_i)\) is the sampling error from study i (the sampling variance \(v_i\) estimated from each study is assumed known), \(\mu\) is the average “true” effect, and \(\tau^2\) is the heterogeneity among true effects.

In this framework, we may think of individual studies as:

replicates
results from a variety of completely different studies of the same topic
exchangeable yet not completely identical or unrelated

Note the following.

\(\mu\) is typically the primary quantity of interest as a summary measure across studies
the error variance \(v_i\) of individual study estimate \(y_i\) varies across studies and is often treated as known as the standard error estimate from study \(i\).

Example: Spanking Data

Kurz considers data on corporal punishment of children. UNICEF (2014) reports that 80% of children worldwide are spanked or physically punished by their parents. Spanking is one of the most studied (and controversial) aspects of parenting, and hundreds of studies are available on the topic. The data spank.xlsx contain 111 summary measures of a variety of child behavioral, emotional, cognitive, and physical outcomes from studies.

library(readxl)
spank=readxl::read_excel("data/spank.xlsx")
head(spank)

## # A tibble: 6 x 8
##   study                year outcome        between within     d    ll    ul
##   <chr>               <dbl> <chr>            <dbl>  <dbl> <dbl> <dbl> <dbl>
## 1 Bean and Roberts (…  1981 Immediate def…       1      0 -0.74 -1.76  0.28
## 2 Day and Roberts (1…  1983 Immediate def…       1      0  0.36 -1.04  1.77
## 3 Minton, Kagan, and…  1971 Immediate def…       0      1  0.34 -0.09  0.76
## 4 Roberts (1988)       1988 Immediate def…       1      0 -0.08 -1.01  0.84
## 5 Roberts and Powers…  1990 Immediate def…       1      0  0.1  -0.82  1.03
## 6 Burton, Maccoby, a…  1961 Low moral int…       0      1  0.63  0.16  1.1

The effect size of interest in the meta-analysis is the standardized difference in mean outcomes given by \[d=\frac{\mu_{spanked}-\mu_{not spanked}}{\sigma_{pooled}},\] where \[\sigma_{pooled}=\sqrt{\frac{(n_1-1)\sigma_1^2+(n_2-1)\sigma_2^2}{n_1+n_2-2}}.\]

This effect size is just a mean difference converted to standard deviation units. Most effect sizes will be fairly small – for example, seeing an effect size of 1 would correspond to a 1 SD difference in the outcome between the spanking groups. Let’s peek at the full data in a forest plot.

library(forestplot)
forestplot(rep(NA,length(spank$study)),spank$d,spank$ll,spank$ul,
           col = fpColors(lines="#990000", box="#660000", zero = "darkblue"))

Note that the data on the previous slides do not provide us with standard errors for the effect sizes \(d\); however, we can calculate them based on the CI’s as

\[SE=\frac{\text{upper limit}-\text{lower limit}}{2\times 1.96}.\]

library(tidyverse)
spank <-
  spank %>% 
  mutate(se = (ul - ll) / 3.92)
glimpse(spank)

## Observations: 111
## Variables: 9
## $ study   <chr> "Bean and Roberts (1981)", "Day and Roberts (1983)", "Mi…
## $ year    <dbl> 1981, 1983, 1971, 1988, 1990, 1961, 1962, 1990, 2002, 20…
## $ outcome <chr> "Immediate defiance", "Immediate defiance", "Immediate d…
## $ between <dbl> 1, 1, 0, 1, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0,…
## $ within  <dbl> 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 1,…
## $ d       <dbl> -0.74, 0.36, 0.34, -0.08, 0.10, 0.63, 0.19, 0.47, 0.14, …
## $ ll      <dbl> -1.76, -1.04, -0.09, -1.01, -0.82, 0.16, -0.14, 0.20, -0…
## $ ul      <dbl> 0.28, 1.77, 0.76, 0.84, 1.03, 1.10, 0.53, 0.74, 0.70, 0.…
## $ se      <dbl> 0.52040816, 0.71683673, 0.21683673, 0.47193878, 0.471938…

Model for Spanking Data

\[y_i=\theta_i+e_i~~~~\theta_i=\mu+b_i ~~~ b_i \sim N(0,\tau^2),\] where \(y_i\) is the effect estimate (possibly transformed) from study \(i\), and \(e_i \sim N(0,v_i)\) is the sampling error from study i (the sampling variance \(v_i\) estimated from each study is assumed known). Let’s put a half-Cauchy(0,1) prior on \(\tau\) and a N(0,1) prior on \(\mu\) (as it would be really rare to have a summary \(d\) that was very big on the effect size scale – probably not the case for spanking but maybe if we were measuring more severe physical abuse).

library(brms)
m.spank <- 
  brm(data = spank, family = gaussian,
      d | se(se) ~ 1 + (1 | study),
      prior = c(prior(normal(0, 1), class = Intercept),
                prior(cauchy(0, 1), class = sd)),
      iter = 4000, warmup = 1000, cores = 4, chains = 4,
      seed = 123)

print(m.spank)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: d | se(se) ~ 1 + (1 | study) 
##    Data: spank (Number of observations: 111) 
## Samples: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
##          total post-warmup samples = 12000
## 
## Group-Level Effects: 
## ~study (Number of levels: 76) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.26      0.03     0.21     0.33 1.00     2645     3966
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     0.38      0.04     0.31     0.45 1.00     1768     3239
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

Our summary measure for \(d\) is 0.38, with 95% CrI=(0.31,0.45). Kids who were spanked had on average scores 0.38 SD higher than kids who were not spanked. (Note: presumably these studies are not randomized, and this correlation does not imply causation.)

What are the outcomes?

spank %>% 
  distinct(outcome) %>% 
  knitr::kable()

outcome
Immediate defiance
Low moral internalization
Child aggression
Child antisocial behavior
Child externalizing behavior problems
Child internalizing behavior problems
Child mental health problems
Child alcohol or substance abuse
Negative parent–child relationship
Impaired cognitive ability
Low self-esteem
Low self-regulation
Victim of physical abuse
Adult antisocial behavior
Adult mental health problems
Adult alcohol or substance abuse
Adult support for physical punishment

These outcomes were coded by authors in the same direction, so that larger values of \(d\) imply more negative outcomes among kids who were spanked.

One interesting aspect of the data is while we have 111 outcome effect sizes, these come from only 76 separate studies – some studies measured multiple outcomes.

spank %>% 
  distinct(study) %>% 
  count()

## # A tibble: 1 x 1
##       n
##   <int>
## 1    76

We may wish to shrink outcomes of similar types together – so let’s fit a cross-classified random effects model by adding a random effect for outcome type.

m.spank.outcome <- 
  brm(data = spank, family = gaussian,
      d | se(se) ~ 1 + (1 | study) + (1 | outcome),
      prior = c(prior(normal(0, 1), class = Intercept),
                prior(cauchy(0, 1), class = sd)),
      iter = 4000, warmup = 1000, cores = 4, chains = 4,
      seed = 123)

print(m.spank.outcome)

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: d | se(se) ~ 1 + (1 | study) + (1 | outcome) 
##    Data: spank (Number of observations: 111) 
## Samples: 4 chains, each with iter = 4000; warmup = 1000; thin = 1;
##          total post-warmup samples = 12000
## 
## Group-Level Effects: 
## ~outcome (Number of levels: 17) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.08      0.03     0.04     0.14 1.00     2909     4776
## 
## ~study (Number of levels: 76) 
##               Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## sd(Intercept)     0.25      0.03     0.20     0.32 1.00     2461     4797
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Rhat Bulk_ESS Tail_ESS
## Intercept     0.36      0.04     0.27     0.44 1.00     1915     3586
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

The estimates of \(d\) are quite similar to our previous ones. Looking at the variance components, the study-to-study heterogeneity is larger than heterogeneity across outcomes. We can explore further in a figure.

# we'll want this to label the plot
label <-
  tibble(tau   = c(.12, .3),
         y     = c(15, 10),
         label = c("sigma['outcome']", "sigma['study']"))

# wrangle
posterior_samples(m.spank.outcome) %>% 
  select(starts_with("sd")) %>% 
  gather(key, tau) %>% 
  mutate(key = str_remove(key, "sd_") %>% str_remove(., "__Intercept")) %>% 
  
  # plot
  ggplot(aes(x = tau)) +
  geom_density(aes(fill = key),
               color = "transparent") +
  geom_text(data = label,
            aes(y = y, label = label, color = label),
            parse = T, size = 5) +
  scale_fill_viridis_d(NULL, option = "B", begin = .5) +
  scale_color_viridis_d(NULL, option = "B", begin = .5) +
  scale_y_continuous(NULL, breaks = NULL) +
  xlab(expression(tau)) +
  theme(panel.grid = element_blank())

What I’m really curious about is whether spanking has similar effects on all the different outcomes – let’s examine those more closely.

library(tidybayes)
library(brms)
m.spank.outcome %>% 
  spread_draws(b_Intercept, r_outcome[outcome,]) %>%
  # add the grand mean to the group-specific deviations
  mutate(mu = b_Intercept + r_outcome) %>%
  ungroup() %>%
  mutate(outcome = str_replace_all(outcome, "[.]", " ")) %>% 

  # plot
  ggplot(aes(x = mu, y = reorder(outcome, mu), fill = reorder(outcome, mu))) +
  geom_vline(xintercept = fixef(m.spank.outcome)[1, 1], color = "grey33", size = 1) +
  geom_vline(xintercept = fixef(m.spank.outcome)[1, 3:4], color = "grey33", linetype = 2) +
  geom_halfeyeh(.width = .95, size = 2/3, color = "white") +
  scale_fill_viridis_d(option = "B", begin = .2) +
  labs(x = expression(italic("Cohen's d")),
       y = NULL) +
  theme(panel.grid   = element_blank(),
        axis.ticks.y = element_blank(),
        axis.text.y  = element_text(hjust = 0))

We see evidence that spanking may be particularly linked with child externalizing behavior problems (again, this is chicken & egg – we cannot infer causation).