Bayesian Inference for the Hierarchical Normal Model

Reading

This lecture is based on Chapter 4 of Hoff and Chapter 8 of baby Hoff (the 360/601/602 text). Note the “baby Hoff” text contains substantial code that may be useful.

Bayesian estimation is often the approach of choice for fitting hierarchical models. Two major advantages include

estimation and computation, particularly in complex, highly structured, or generalized linear models
straightforward uncertainty quantification

Hierarchical Normal Model

Recall our data model:

\[\begin{eqnarray*} y_{ij}&=&\mu_j+\varepsilon_{ij}=\mu_j + \varepsilon_{ij} \end{eqnarray*}\]

where \(\mu_j=\mu +\alpha_j\), \(\alpha_j \overset{iid}{\sim} N\left(0,\tau^2\right) \perp \varepsilon_{ij} \overset{iid}{\sim} N\left(0,\sigma^2\right)\)

In addition to the data model we specified previously, we will specify a prior distribution for \((\mu,\tau^2,\sigma^2)\), denoted \(p(\theta)=p(\mu,\tau^2,\sigma^2)\).

Bayesian Specification of the Model

We will start with a “default” prior specification given by \[p(\mu,\tau^2,\sigma^2)=p(\mu|\tau^2)p(\tau^2)p(\sigma^2),\] where

\(p(\mu | \tau^2)\) is \(N\left(\mu_0,\frac{\tau^2}{m_0}\right)\)
\(p(\tau^2)\) is inverse-gamma\(\left(\frac{\eta_0}{2},\frac{\eta_0\tau_0^2}{2}\right)\)
\(p(\sigma^2)\) is inverse-gamma\(\left(\frac{\nu_0}{2},\frac{\nu_0\sigma_0^2}{2}\right)\)

With this default prior specification, we have nice interpretations of the prior parameters.

For \(p(\mu | \tau^2) = N\left(\mu_0,\frac{\tau^2}{m_0}\right)\), \(\mu_0\) is a prior guess at \(\mu\), and \(m_0\) describes our certainty in this guess (you can think of it as a prior sample size)
For \(p(\tau^2) = IG\left(\frac{\eta_0}{2},\frac{\eta_0\tau_0^2}{2}\right)\), \(\tau_0^2\) is a prior guess at the across-group variance \(\tau^2\), and \(\eta_0\) describes our confidence in this guess
For \(p(\sigma^2)=IG\left(\frac{\nu_0}{2},\frac{\nu_0\sigma_0^2}{2}\right)\), \(\sigma_0^2\) is a prior guess at the within-group variance \(\sigma^2\), and \(\nu_0\) describes our certainty in this guess

Fully-Specified Model

We have now fully-specified our model with the following components.

Unknown parameters \((\mu,\tau^2,\sigma^2,\alpha_1,\cdots,\alpha_J)\)
Prior distributions, specified in terms of prior guesses \((\mu_0,\tau_0^2,\sigma_0^2)\) and certainty/prior sample sizes \((m_0,\eta_0,\nu_0)\)
Don’t forget the data \(y,x\) from our groups!

Posterior distributions

We can interrogate the posterior distribution of the parameters using Gibbs sampling, as the full conditional distributions have closed forms. We’ll next examine the full conditional distributions of each of our parameters \((\mu_0, \tau_0^2, \sigma_0^2, \mu_1, \cdots, \mu_J)\) in turn.

\[\begin{eqnarray*} p(\mu &\mid& \tau^2,\sigma^2, Y, \mu_1, \cdots, \mu_J) \propto p(\mu, \tau^2,\sigma^2, Y, \mu_1, \cdots, \mu_J) \\ &\propto & p(\mu \mid \tau^2) p(\tau^2)p(\sigma^2)\prod_{j=1}^J p(\mu_j \mid \mu, \tau^2)\prod_{i=1}^{n_j} p(y_{ij} \mid \mu_j, \sigma^2) \\ &\propto & p(\mu \mid \tau^2)\prod_{j=1}^J p(\mu_j \mid \mu, \tau^2) ~~~ \text{(dropping stuff not involving } \mu \text{)} \\ &\propto & \exp \left\{-\frac{m_0(\mu-\mu_0)^2}{2\tau^2}\right\} \exp \left\{-\frac{\sum (\mu-\mu_j)^2}{2\tau^2}\right\} \end{eqnarray*}\]

You can show then that \((\mu \mid \tau^2, \mu_1, \cdots, \mu_J)=N(e,v)\) where \(e=\frac{J}{m_0+J}\frac{1}{J}\sum \mu_j + \frac{m_0}{J+m_0}\mu_0\) and \(v=\frac{\tau^2}{J+m_0}\), and here we can think of \(m_0\) as the number of “prior observations” and \(\mu_0\) as the sample mean of those prior observations. In practice, people take \(\mu_0\) to be a reasonable prior guess, and if they’re really unsure they take \(m_0=1\), implying that this is equivalent roughly to the certainty you would have from only one sampled value from the population (so not very much information).

How do you show that?

It helps if you remember how to complete the square.

\[\begin{eqnarray*} & & \exp\left\{ \frac{-m_0(\mu-\mu_0)^2}{2\tau^2}-\frac{\sum(\mu-\mu_j)^2}{2\tau^2}\right\} \\ &=&\exp\left\{\frac{-m_0\mu_0^2+2\mu m_0 \mu_0 -m_0\mu_0^2 - J\mu^2+2\mu\sum \mu_j -\sum \mu_j^2}{2\tau^2}\right\} \\ &\propto& \exp\left\{\frac{-m_0\mu^2+2\mu m_0 \mu_0 -J\mu^2+2\mu\sum \mu_j}{2\tau^2}\right\} \\ &=&\exp\left\{\frac{\mu^2(-m_0-J)+\mu(2m_0\mu_0+2\sum \mu_j)}{2\tau^2}\right\} \\ &=&\exp\left\{ \frac{-m_0-J}{2\tau^2}\left[\mu^2 + \mu \left( \frac{2m_0\mu_0+2\sum \mu_j}{-m_0-J}\right) \right]\right\} \\ &\propto& \exp\left\{ \left(\frac{-(m_0+J)}{2\tau^2}\right)\left[\mu - \frac{m_0\mu_0 + \sum \mu_j}{m_0+J} \right]^2 \right\} \end{eqnarray*}\]

Ahh, this is the kernel of a normal distribution for \(\mu\)!

We recognize this conditional posterior distribution of \(\{\mu \mid \tau^2, \mu_1, \cdots, \mu_J\}\) as the kernel of a normal distribution with mean \(\frac{m_0\mu_0+\sum \mu_j}{\mu_0+J}\) and variance \(\frac{\tau^2}{m_0+J}\).

To motivate interpretation of the prior parameters, express the mean \(\frac{m_0\mu_0+\sum \mu_j}{\mu_0+J}=\frac{J}{m_0+J}\frac{1}{J}\sum \mu_j + \frac{m_0}{J+m_0}\mu_0\).

We can think of

\(m_0\) as the number of prior observations with sample mean \(\mu_0\)
the conditional mean as the average of all the observations (prior and current)
the conditional variance as the population variance divided by the “total” sample size \(J+m_0\)

Using similar algebra, we can show the full conditional distribution of \(\mu_j\) is given by

\[\begin{eqnarray*} p(\mu_j &\mid& \mu, \tau^2,\sigma^2, Y, \mu_1, \cdots, \mu_J) \propto p(\mu, \tau^2,\sigma^2, Y, \mu_1, \cdots, \mu_J) \\ &\propto & p(\mu \mid \tau^2) p(\tau^2)p(\sigma^2)\prod_{j=1}^J p(\mu_j \mid \mu, \tau^2)\prod_{i=1}^{n_j} p(y_{ij} \mid \mu_j, \sigma^2) \\ &\propto& p(\mu_j \mid \mu, \tau^2) \prod_{j=1}^{n_j} p(y_{ij} \mid \mu_j, \sigma^2) ~~~~ \propto ~~~~ \exp \left\{-\frac{(\mu_j-\mu)^2}{2\tau^2}\right\} \exp \left\{-\frac{\sum (y_{ij}-\mu_j)^2}{2\sigma^2}\right\} \end{eqnarray*}\]

For homework you will show this is a normal distribution with variance \(v_j=\frac{1}{\frac{n_j}{\sigma^2}+\frac{1}{\tau^2}}\) and mean \(v_j\left[\frac{n_j}{\sigma^2}\overline{y}_j+\frac{1}{\tau^2}\mu\right]\).

The full conditional posterior distribution of \(\mu_j\) has a nice form, and we can see that the conditional mean is a compromise between the sample mean (weighted by the data precision) and the overall mean (weighted by the prior across-group precision), while the conditional precision (inverse variance) can be thought of as the sum of data precision and prior across-group precision.

Inverse Gamma Distribution

The inverse gamma (IG) distribution is a popular choice for modeling variance components given its support on the positive real numbers. If the random variable \(\frac{1}{X}\) has a gamma distribution, we say \(X\) has an inverse gamma distribution, given by \[p(x \mid \alpha, \beta)=\frac{\beta^\alpha}{\Gamma(\alpha)}x^{-\alpha-1}e^{-\frac{\beta}{x}} \text{ for } x>0.\] This distribution has mean \(\frac{\beta}{\alpha-1}\), mode \(\frac{\beta}{\alpha+1}\), and variance \(\frac{\beta^2}{(\alpha-1)^2(\alpha-2)}\).

Using an IG\(\left(\frac{m_0}{2},\frac{m_0\tau_0^2}{2}\right)\) distribution for \(\tau^2\), we can now see that \(\tau_0^2\) is somewhere in the “center” of the distribution (between the mode \(\frac{m_0\tau_0^2}{m_0+2}\) and the mean \(\frac{m_0\tau_0^2}{m_0-2}\)). As the “prior sample size” \(m_0\) increases, the difference between these quantities goes to 0.

The full conditional posterior for \(\tau^2\) in our hierarchical normal model is given by

\[\begin{eqnarray*} p(\tau^2 &\mid& \mu, \sigma^2, Y, \mu_1, \cdots, \mu_J) \propto p(\mu, \tau^2,\sigma^2, Y, \mu_1, \cdots, \mu_J) \\ &\propto& p(\tau^2) p(\mu \mid \tau^2) \prod p(\mu_1, \cdots, \mu_J \mid \mu, \tau^2) \\ &\propto& \left(\tau^2\right)^{-\frac{m_0}{2}-1}e^{-\frac{m_0\tau_0^2}{2\tau^2}} \left(\tau^2\right)^{-\frac{1}{2}}e^{-\frac{m_0(\mu-\mu_0)^2}{2\tau^2}}\prod_{j=1}^J \left(\tau^2\right)^{-\frac{1}{2}}e^{-\frac{(\mu_j-\mu)^2}{2\tau^2}} \\ &\propto& (\tau^2)^{-\frac{m_0+J+1}{2}-1}e^{-\frac{m_0\tau_0^2+m_0(\mu-\mu_0)^2+\sum (\mu_j-\mu)^2}{2\tau^2}} \end{eqnarray*}\]

so that the full conditional posterior is also IG.

You can also show the full conditional posterior of \(\sigma^2\) is \(IG\left(\frac{\nu_0+\sum n_j}{2},\frac{\nu_0\sigma_0^2+\sum \sum (y_{ij}-\mu_j)^2}{2}\right)\) under our prior specification of \(\sigma^2 \sim IG\left(\frac{\nu_0}{2},\frac{\nu_0\sigma_0^2}{2}\right)\) (homework). Again, \(\sigma_0^2\) and \(\nu_0\) can be interpreted as a prior guess at \(\sigma^2\) associated with a given prior sample size (level of confidence).

Posterior Inference

Now that we have the full conditional posterior distributions, we can use Gibbs sampling to make inference.

Challenge to Validity: Heterogeneous Means and Variances

While many people feel that shrinkage can “do no harm,” it can be quite detrimental when the shrinkage target is not correctly specified.

How might we specify a model to avoid these problems?

\(\alpha_j \sim N(\mu_j(z),\tau^2_j(z))\)

Another potential challenge is that the variance of the response may not be the same for each group. This could be due to a variety of factors. One potential remedy for this issue is to allow the error variance to differ across groups, with

\(p(\sigma^2_j) \overset{iid}{\sim} IG\left(\frac{\nu_0}{2},\frac{\nu_0\sigma_0^2}{2}\right)\)

If we fix \(\nu_0\) and \(\sigma_0^2\) in advance, then \(p(\sigma_J^2 \mid \sigma_1^2, \ldots, \sigma_{J-1}^2)=p(\sigma_J^2)\). This means that information about \(\sigma_1^2, \ldots, \sigma_{J-1}^2\) is not used in the estimation of \(\sigma_J^2\), which is inefficient, especially when some groups \(j\) have low sample sizes. Ideally, we would borrow some information across groups in estimating the group-specific variances.

A remedy is to use the (conditionally conjugate) hyperprior \(p(\sigma^2_0) \sim \text{gamma}(a,b)\). This simple prior enables hierarchical modeling of our variances.

There is not an analogous conjugate choice for \(p(\nu_0)\); however, sampling is not too bad (see “baby Hoff”) if we restrict \(\nu_0\) values to integers with the geometric prior \(p(\nu_0) \propto e^{-\lambda \nu_0}\) and we subsequently compute the unnormalized posterior for a large number of values and sample from them.

Lab Wednesday

In the next two labs, we’ll program in R (by hand) an R function that uses Gibbs sampling (and modifications thereof) for inference in the model specified in this lecture and apply it to data on health care.