Priors for Group-Level Variance

Reading

This set of notes is based on Andrew Gelman’s 2006 paper on priors for variance components in hierarchical models.

Example: Exercise Study I

As part of a study of peer influences on health, a researcher recruited a large number of mutually exclusive friend/peer groups of university students. A physical activity questionnaire was used to obtain the number of minutes per week each student participated in moderate to vigorous physical activity.

Write the equation for the data model for a random effects ANOVA model for exercise minutes/week across groups. Include a specification of prior distributions for all parameters of interest.

Example: Exercise Study II

In a different study, children at a local elementary school participated in a school-wide initiative to increase physical activity. At the beginning of the study, the child brought home a physical activity questionnaire for each family member (including the child), and this questionnaire was used to obtain the number of minutes/week each individual participated in moderate to vigorous physical activity. Over the next month, children were given different exercises or activities to bring home to the family, and the questionnaires were re-administered at the end of the month.

The response of interest is the change in physical activity for each individual from baseline to follow-up. Write the equation for the data model for a random effects ANOVA model for the response of interest.

Hierarchical Normal Model

Recall our data model:

\[\begin{eqnarray*} y_{ij}&=&\mu_j+\varepsilon_{ij}=\mu + \alpha_j + \varepsilon_{ij} \end{eqnarray*}\]

where \(\mu_j=\mu +\alpha_j\), \(\alpha_j \overset{iid}{\sim} N\left(0,\tau^2\right) \perp \varepsilon_{ij} \overset{iid}{\sim} N\left(0,\sigma^2\right)\)

We now need specify a prior distribution for \((\mu,\tau^2,\sigma^2)\), denoted \(p(\theta)=p(\mu,\tau^2,\sigma^2)\).

Bayesian Specification of the Model

We start with a “default” prior specification given by \[p(\mu,\tau^2,\sigma^2)=p(\mu)p(\tau^2)p(\sigma^2),\] where

  • \(p(\mu)\) is \(N\left(\mu_0,\gamma_0^2\right)\)
  • \(p(\tau^2)\) is inverse-gamma\(\left(\frac{\eta_0}{2},\frac{\eta_0\tau_0^2}{2}\right)\)
  • \(p(\sigma^2)\) is inverse-gamma\(\left(\frac{\nu_0}{2},\frac{\nu_0\sigma_0^2}{2}\right)\)

Relatively noninformative priors

It is often of interest to specify relatively noninformative priors for parameters. In general, we have enough data so that we can use any reasonable prior distribution for \(\mu\) and \(\sigma\), for example \(p(\mu, \sigma) \propto 1\).

Picking a flattish prior for \(\tau^2\) is trickier because sometimes our data do not contain a lot of information about this parameter – e.g., we may have relatively few groups, which is the case in our bike data.

This parameter is also problematic in frequentist models – in particular there is no always-non-negative classical unbiased estimator for it.

High Variance Priors for \(\tau\)

One basic idea is to base a prior on a proper density but inflate the variance so that its shape gets pretty flat.

Two commonly considered priors for \(\tau\) include the uniform(0,A) as \(A \rightarrow \infty\) and inverse-gamma(\(\varepsilon,\varepsilon\)) as \(\varepsilon \rightarrow 0\).

Limits of Prior Distributions

Gelman shows

  • the uniform(0,A) prior on \(\tau\) yields a limiting proper posterior density as \(A \rightarrow \infty\) as long as we have at least 3 groups. The implication is that if we pick A big enough, our resulting inferences are not sensitive to the choice of A.

  • the inverse-gamma(\(\varepsilon,\varepsilon\)) for \(\tau^2\) does not have a proper limiting posterior, so that posterior inferences are sensitive to the value of \(\varepsilon\) chosen, particularly when small values of \(\tau^2\) are reasonable. Unfortunately, these ‘small \(\varepsilon\)’ priors became quite popular due to being used in many illustrative examples in the WinBUGS manuals, though they are no longer recommended.

Parameter Expansion

Gelman (2006) proposes a parameter expansion that facilitates a family of weakly-informative prior distributions.

\[\begin{eqnarray*} y_{ij} &\sim& N(\mu + \xi \eta_j, \sigma^2) \\ \eta_j &\sim& N(0,\sigma_\eta^2) \end{eqnarray*}\]

In this case the parameters \(\alpha_j\) correspond to \(\xi \eta_j\) in this model, and \(\tau\) corresponds to \(|\xi|\sigma_\eta\) here.

For simplicity, consider independent priors on \(\xi\) and \(\sigma_\eta\). A conditionally-conjugate choice for \(\xi\) is normal – and given that the scale of \(\xi\) cannot be separately identified from that of \(\eta_j\), a N(0,1) is convenient.

A conditionally-conjugate prior family for \(\sigma_\eta^2\) is inverse gamma, and when combined with the N(0,1) we have an implied half-t prior for \(\tau\), which is the absolute value of a Student’s t distribution centered at zero.

Half t

The half t prior is a function of a scale parameter \(A\) and degrees of freedom \(\nu\) and can be written \[p(\tau)\propto \left(1+\frac{1}{\nu}\left(\frac{\tau}{A}\right)^2\right)^{-(\nu+1)/2}.\] Gelman proposes a half-Cauchy prior (obtained with a large scale parameter \(A\) and \(\nu=1\)) as a weakly-informative choice that is desirable at times with a small number of groups.

Example: Hospital Income

We consider data from the Centers for Medicare and Medicaid Services on hospital costs and profit from the 2014 fiscal year. Our interest is in examining variability of net income to the hospital across states.

load("data/hc2014.RData")
library(rstan)
rstan_options(auto_write = TRUE)
options(mc.cores = parallel::detectCores())

We’ll fit the model \[y_{ij}=\mu+\alpha_j+\varepsilon_{ij},\] where \(y_{ij}\) represents the net income of hospital \(i\) in state \(j\), and \(\alpha_j\) is a random effect for state.

library(brms)
m1=brm(netincome~(1|state),data=hc2014,
       control = list(adapt_delta = .99),
  prior = c(
    prior(normal(0, 10), class = Intercept),
    prior(student_t(3, 0, 1), class = sd),
    prior(student_t(3, 0, 1), class = sigma)
  ))

##  Family: gaussian 
##   Links: mu = identity; sigma = identity 
## Formula: netincome ~ (1 | state) 
##    Data: hc2014 (Number of observations: 6170) 
## Samples: 4 chains, each with iter = 2000; warmup = 1000; thin = 1;
##          total post-warmup samples = 4000
## 
## Group-Level Effects: 
## ~state (Number of levels: 55) 
##               Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## sd(Intercept)     1.10      1.25     0.04     4.34       4914 1.00
## 
## Population-Level Effects: 
##           Estimate Est.Error l-95% CI u-95% CI Eff.Sample Rhat
## Intercept    -0.04      9.83   -19.11    19.66       8726 1.00
## 
## Family Specific Parameters: 
##          Estimate Est.Error    l-95% CI    u-95% CI Eff.Sample Rhat
## sigma 94827457.64 851471.41 93179530.13 96515781.88       8548 1.00
## 
## Samples were drawn using sampling(NUTS). For each parameter, Eff.Sample 
## is a crude measure of effective sample size, and Rhat is the potential 
## scale reduction factor on split chains (at convergence, Rhat = 1).

We can plot the state-specific means with 95% credible intervals:

library(tidyverse)
library(tidybayes)
m1 %>%
  spread_draws(b_Intercept, r_state[state,]) %>%
  median_qi(state_mean = b_Intercept + r_state) %>%
  ggplot(aes(y = state, x = state_mean, xmin = .lower, xmax = .upper)) +
  geom_pointintervalh()

What is the posterior probability that net hospital income is higher in NC than in Minnesota? Georgia?

mat1=m1 %>%
  spread_draws(r_state[state,]) %>%
  compare_levels(r_state, by = state)
mat2=filter(mat1,state == "NC - MN")
mean(mat2$r_state>0)
## [1] 0.98475
mat2=filter(mat1,state == "NC - GA")
mean(mat2$r_state>0)
## [1] 0.7745

Getting a feel for prior options!

Using the method of your choice, simulate draws from inverse gamma and half-t distributions. Think about potentially centering these priors on \(\tau^2 \approx 1\).

Note: R has several functions, including invgamma, HalfT, halfnormal, HalfCauchy, dist.HalfCauchy

Code (courtesy of Graham, Yiwei, and Zixi)

library(tidyverse)
## ── Attaching packages ──────────────────────────────────── tidyverse 1.2.1 ──
## ✔ tibble  2.1.3     ✔ purrr   0.3.2
## ✔ tidyr   0.8.3     ✔ dplyr   0.8.3
## ✔ readr   1.3.1     ✔ stringr 1.4.0
## ✔ tibble  2.1.3     ✔ forcats 0.4.0
## Warning: package 'tibble' was built under R version 3.5.2
## Warning: package 'tidyr' was built under R version 3.5.2
## Warning: package 'purrr' was built under R version 3.5.2
## Warning: package 'dplyr' was built under R version 3.5.2
## Warning: package 'stringr' was built under R version 3.5.2
## Warning: package 'forcats' was built under R version 3.5.2
## ── Conflicts ─────────────────────────────────────── tidyverse_conflicts() ──
## ✖ tidyr::extract() masks rstan::extract()
## ✖ dplyr::filter()  masks stats::filter()
## ✖ dplyr::lag()     masks stats::lag()
n <- 10000

samples <- data.frame(halft3 = abs(rt(n=n,df=3)),
                      halfnorm = abs(rnorm(n,0,1)),
                      normsq = rnorm(n,0,1)^2,
                      invgam = 1/rgamma(n,3,3)) 

samples %>% 
  reshape2::melt() %>% 
  ggplot(mapping = aes(x=value,fill = variable)) + xlim(0,10) + 
  geom_density(alpha = .5)

samples %>% 
  reshape2::melt() %>% 
  ggplot(mapping = aes(y=value,fill = variable)) + ylim(0,10) + 
  geom_boxplot()

samples %>% 
  reshape2::melt() %>% 
  group_by(variable) %>% 
  summarise(mean= mean(value),
            sd = sd(value),
            q75 = quantile(value,.75),
            q25 = quantile(value,.25))

## # A tibble: 4 x 5
##   variable  mean    sd   q75    q25
##   <fct>    <dbl> <dbl> <dbl>  <dbl>
## 1 halft3   1.12  1.27   1.45 0.359 
## 2 halfnorm 0.799 0.615  1.16 0.310 
## 3 normsq   0.997 1.41   1.32 0.0955
## 4 invgam   1.48  1.52   1.71 0.756