This lecture is based on Chapter 4 of Hoff and Chapter 8 of baby Hoff (the 360/601/602 text). Note the “baby Hoff” text contains substantial code that may be useful.
Bayesian estimation is often the approach of choice for fitting hierarchical models. Two major advantages include
estimation and computation, particularly in complex, highly structured, or generalized linear models
straightforward uncertainty quantification
Recall our data model:
\[\begin{eqnarray*} y_{ij}&=&\mu_j+\varepsilon_{ij}=\mu_j + \varepsilon_{ij} \end{eqnarray*}\]where \(\mu_j=\mu +\alpha_j\), \(\alpha_j \overset{iid}{\sim} N\left(0,\tau^2\right) \perp \varepsilon_{ij} \overset{iid}{\sim} N\left(0,\sigma^2\right)\)
In addition to the data model we specified previously, we will specify a prior distribution for \((\mu,\tau^2,\sigma^2)\), denoted \(p(\theta)=p(\mu,\tau^2,\sigma^2)\).
We will start with a “default” prior specification given by \[p(\mu,\tau^2,\sigma^2)=p(\mu|\tau^2)p(\tau^2)p(\sigma^2),\] where
With this default prior specification, we have nice interpretations of the prior parameters.
For \(p(\mu | \tau^2) = N\left(\mu_0,\frac{\tau^2}{m_0}\right)\), \(\mu_0\) is a prior guess at \(\mu\), and \(m_0\) describes our certainty in this guess (you can think of it as a prior sample size)
For \(p(\tau^2) = IG\left(\frac{\eta_0}{2},\frac{\eta_0\tau_0^2}{2}\right)\), \(\tau_0^2\) is a prior guess at the across-group variance \(\tau^2\), and \(\eta_0\) describes our confidence in this guess
For \(p(\sigma^2)=IG\left(\frac{\nu_0}{2},\frac{\nu_0\sigma_0^2}{2}\right)\), \(\sigma_0^2\) is a prior guess at the within-group variance \(\sigma^2\), and \(\nu_0\) describes our certainty in this guess
We have now fully-specified our model with the following components.
Unknown parameters \((\mu,\tau^2,\sigma^2,\alpha_1,\cdots,\alpha_J)\)
Prior distributions, specified in terms of prior guesses \((\mu_0,\tau_0^2,\sigma_0^2)\) and certainty/prior sample sizes \((m_0,\eta_0,\nu_0)\)
Don’t forget the data \(y,x\) from our groups!
We can interrogate the posterior distribution of the parameters using Gibbs sampling, as the full conditional distributions have closed forms. We’ll next examine the full conditional distributions of each of our parameters \((\mu_0, \tau_0^2, \sigma_0^2, \mu_1, \cdots, \mu_J)\) in turn.
You can show then that \((\mu \mid \tau^2, \mu_1, \cdots, \mu_J)=N(e,v)\) where \(e=\frac{J}{m_0+J}\frac{1}{J}\sum \mu_j + \frac{m_0}{J+m_0}\mu_0\) and \(v=\frac{\tau^2}{J+m_0}\), and here we can think of \(m_0\) as the number of “prior observations” and \(\mu_0\) as the sample mean of those prior observations. In practice, people take \(\mu_0\) to be a reasonable prior guess, and if they’re really unsure they take \(m_0=1\), implying that this is equivalent roughly to the certainty you would have from only one sampled value from the population (so not very much information).
How do you show that?
It helps if you remember how to complete the square.
Ahh, this is the kernel of a normal distribution for \(\mu\)!
We recognize this conditional posterior distribution of \(\{\mu \mid \tau^2, \mu_1, \cdots, \mu_J\}\) as the kernel of a normal distribution with mean \(\frac{m_0\mu_0+\sum \mu_j}{\mu_0+J}\) and variance \(\frac{\tau^2}{m_0+J}\).
To motivate interpretation of the prior parameters, express the mean \(\frac{m_0\mu_0+\sum \mu_j}{\mu_0+J}=\frac{J}{m_0+J}\frac{1}{J}\sum \mu_j + \frac{m_0}{J+m_0}\mu_0\).
We can think of
Using similar algebra, we can show the full conditional distribution of \(\mu_j\) is given by
\[\begin{eqnarray*} p(\mu_j &\mid& \mu, \tau^2,\sigma^2, Y, \mu_1, \cdots, \mu_J) \propto p(\mu, \tau^2,\sigma^2, Y, \mu_1, \cdots, \mu_J) \\ &\propto & p(\mu \mid \tau^2) p(\tau^2)p(\sigma^2)\prod_{j=1}^J p(\mu_j \mid \mu, \tau^2)\prod_{i=1}^{n_j} p(y_{ij} \mid \mu_j, \sigma^2) \\ &\propto& p(\mu_j \mid \mu, \tau^2) \prod_{j=1}^{n_j} p(y_{ij} \mid \mu_j, \sigma^2) ~~~~ \propto ~~~~ \exp \left\{-\frac{(\mu_j-\mu)^2}{2\tau^2}\right\} \exp \left\{-\frac{\sum (y_{ij}-\mu_j)^2}{2\sigma^2}\right\} \end{eqnarray*}\]For homework you will show this is a normal distribution with variance \(v_j=\frac{1}{\frac{n_j}{\sigma^2}+\frac{1}{\tau^2}}\) and mean \(v_j\left[\frac{n_j}{\sigma^2}\overline{y}_j+\frac{1}{\tau^2}\mu\right]\).
The full conditional posterior distribution of \(\mu_j\) has a nice form, and we can see that the conditional mean is a compromise between the sample mean (weighted by the data precision) and the overall mean (weighted by the prior across-group precision), while the conditional precision (inverse variance) can be thought of as the sum of data precision and prior across-group precision.
The inverse gamma (IG) distribution is a popular choice for modeling variance components given its support on the positive real numbers. If the random variable \(\frac{1}{X}\) has a gamma distribution, we say \(X\) has an inverse gamma distribution, given by \[p(x \mid \alpha, \beta)=\frac{\beta^\alpha}{\Gamma(\alpha)}x^{-\alpha-1}e^{-\frac{\beta}{x}} \text{ for } x>0.\] This distribution has mean \(\frac{\beta}{\alpha-1}\), mode \(\frac{\beta}{\alpha+1}\), and variance \(\frac{\beta^2}{(\alpha-1)^2(\alpha-2)}\).
Using an IG\(\left(\frac{m_0}{2},\frac{m_0\tau_0^2}{2}\right)\) distribution for \(\tau^2\), we can now see that \(\tau_0^2\) is somewhere in the “center” of the distribution (between the mode \(\frac{m_0\tau_0^2}{m_0+2}\) and the mean \(\frac{m_0\tau_0^2}{m_0-2}\)). As the “prior sample size” \(m_0\) increases, the difference between these quantities goes to 0.
The full conditional posterior for \(\tau^2\) in our hierarchical normal model is given by
\[\begin{eqnarray*} p(\tau^2 &\mid& \mu, \sigma^2, Y, \mu_1, \cdots, \mu_J) \propto p(\mu, \tau^2,\sigma^2, Y, \mu_1, \cdots, \mu_J) \\ &\propto& p(\tau^2) p(\mu \mid \tau^2) \prod p(\mu_1, \cdots, \mu_J \mid \mu, \tau^2) \\ &\propto& \left(\tau^2\right)^{-\frac{m_0}{2}-1}e^{-\frac{m_0\tau_0^2}{2\tau^2}} \left(\tau^2\right)^{-\frac{1}{2}}e^{-\frac{m_0(\mu-\mu_0)^2}{2\tau^2}}\prod_{j=1}^J \left(\tau^2\right)^{-\frac{1}{2}}e^{-\frac{(\mu_j-\mu)^2}{2\tau^2}} \\ &\propto& (\tau^2)^{-\frac{m_0+J+1}{2}-1}e^{-\frac{m_0\tau_0^2+m_0(\mu-\mu_0)^2+\sum (\mu_j-\mu)^2}{2\tau^2}} \end{eqnarray*}\]so that the full conditional posterior is also IG.
You can also show the full conditional posterior of \(\sigma^2\) is \(IG\left(\frac{\nu_0+\sum n_j}{2},\frac{\nu_0\sigma_0^2+\sum \sum (y_{ij}-\mu_j)^2}{2}\right)\) under our prior specification of \(\sigma^2 \sim IG\left(\frac{\nu_0}{2},\frac{\nu_0\sigma_0^2}{2}\right)\) (homework). Again, \(\sigma_0^2\) and \(\nu_0\) can be interpreted as a prior guess at \(\sigma^2\) associated with a given prior sample size (level of confidence).
Now that we have the full conditional posterior distributions, we can use Gibbs sampling to make inference.
While many people feel that shrinkage can “do no harm,” it can be quite detrimental when the shrinkage target is not correctly specified.
How might we specify a model to avoid these problems?
\(\alpha_j \sim N(\mu_j(z),\tau^2_j(z))\)
Another potential challenge is that the variance of the response may not be the same for each group. This could be due to a variety of factors. One potential remedy for this issue is to allow the error variance to differ across groups, with
\(p(\sigma^2_j) \overset{iid}{\sim} IG\left(\frac{\nu_0}{2},\frac{\nu_0\sigma_0^2}{2}\right)\)
If we fix \(\nu_0\) and \(\sigma_0^2\) in advance, then \(p(\sigma_J^2 \mid \sigma_1^2, \ldots, \sigma_{J-1}^2)=p(\sigma_J^2)\). This means that information about \(\sigma_1^2, \ldots, \sigma_{J-1}^2\) is not used in the estimation of \(\sigma_J^2\), which is inefficient, especially when some groups \(j\) have low sample sizes. Ideally, we would borrow some information across groups in estimating the group-specific variances.
A remedy is to use the (conditionally conjugate) hyperprior \(p(\sigma^2_0) \sim \text{gamma}(a,b)\). This simple prior enables hierarchical modeling of our variances.
There is not an analogous conjugate choice for \(p(\nu_0)\); however, sampling is not too bad (see “baby Hoff”) if we restrict \(\nu_0\) values to integers with the geometric prior \(p(\nu_0) \propto e^{-\lambda \nu_0}\) and we subsequently compute the unnormalized posterior for a large number of values and sample from them.
In the next two labs, we’ll program in R (by hand) an R function that uses Gibbs sampling (and modifications thereof) for inference in the model specified in this lecture and apply it to data on health care.