More on ANOVA

Reading

This lecture is based on Section 2.1 of Hoff.

Linear Model Estimates

Consider a very simple one-sample linear model given by \(y_i=\mu+\varepsilon_i\), \(\varepsilon_i \sim N(0,\sigma^2)\) in matrix notation, given by \[\begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}=\begin{pmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{pmatrix} \begin{pmatrix} \mu \end{pmatrix} + \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{pmatrix}\] with the vector \(\varepsilon \sim N(0_{n \times 1},\sigma^2I_{n \times n})\).

Recalling that the normal distribution for one observation is given by \[\frac{1}{\sqrt{2 \pi}\sigma}\exp{-\frac{1}{2}(y_i-\mu)^2}\] we can obtain the likelihood by taking the product over all \(n\) independent observations: \[\begin{eqnarray*} L(y,\mu,\sigma)&=&\prod_{i=1}^n \frac{1}{\sqrt{2\pi}\sigma}\exp\left\{-\frac{1}{2}\frac{(y_i-\mu)^2}{\sigma^2}\right\} \\ &=& \left(\frac{1}{\sqrt{2\pi}\sigma}\right)^n\exp\left\{-\frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-\mu)^2\right\} \end{eqnarray*}\]

To find the MLE solve for the parameter values that make the first derivative equal to 0 (often we work with the log-likelihood as it is more convenient).

The log-likelihood is given by

\[\begin{eqnarray*} \ell(y,\mu,\sigma^2)&=& n \log \frac{1}{\sqrt(2\pi\sigma^2)} - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-\mu)^2 \\ &=& -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-\mu)^2 \end{eqnarray*}\]

To find the MLE of \(\mu\), take the derivative

\[\begin{eqnarray*} \frac{\partial \ell(\mu,\sigma^2)}{\partial \mu} &=& 0 -\frac{1}{2\sigma^2} \sum_{i=1}^n 2(y_i-\mu)(-1) \\ &=& \frac{1}{\sigma^2}\left(\sum_{i=1}^n y_i - n\mu \right) \end{eqnarray*}\]

Setting this equal to zero, we obtain the MLE

\[\begin{eqnarray*} n\widehat{\mu}&=&\sum_{i=1}^n y_i \\ \widehat{\mu}&=& \frac{\sum_{i=1}^n y_i}{n}=\overline{y} \end{eqnarray*}\]

To find the MLE of \(\sigma^2\) take the derivative of \(-\frac{n}{2}[\log(2\pi)+\log(\sigma^2)]- \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-\mu)^2\)

\[\begin{eqnarray*} \frac{\partial \ell(\mu,\sigma^2)}{\partial \sigma^2} &=& 0-\frac{n}{2}\frac{1}{\sigma^2} - \frac{1}{2(-1)\left(\sigma^2\right)^2}\sum_{i=1}^n (y_i-\mu)^2 \\ &=& -\frac{n}{2\sigma^2} + \frac{1}{2\left(\sigma^2\right)^2}\sum_{i=1}^n (y_i-\mu)^2 \end{eqnarray*}\]

Setting to 0 and solving for the MLE, using the MLE of \(\mu\) we just found, we obtain

\(\widehat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n (y_i-\overline{y})^2\).

Note this MLE of \(\sigma^2\) is not the sample variance \(s^2\). We will return to this point later in the course.

Properties of MLE’s

For any MLE \(\widehat{\theta}\),

\(\widehat{\theta} \rightarrow \theta\) as \(n \rightarrow \infty\) (if the model is correct)
\(\widehat{\theta} \sim N\left(\theta, \text{Var}(\widehat{\theta})\right)\), where \(\text{Var}(\widehat{\theta}) \approx \left[ \frac{d^2l(\theta \mid y)}{d\theta^2}\right]^{-1}\) in large samples

For the hierarchical model, this gives us a method for getting approximate 95% confidence intervals for mean parameters (and functions of them).

Information

The observed information matrix is the matrix of second derivatives of the negative log-likelihood function at the MLE (Hessian matrix): \[J(\widehat{\theta})=\left\{ -\frac{\partial^2 \ell(\theta \mid y)}{\partial \theta_j \partial \theta_k}\right\}|_{\theta=\widehat{\theta}}\]

The inverse of the information matrix gives us an estimate of the variance/covariance of MLE’s: \[\widehat{\text{Var}}(\widehat{\theta})\approx J^{-1}(\widehat{\theta})\]

The square roots of the diagonal elements of this matrix give approximate SE’s for the coefficients, and the MLE \(\pm\) 2 SE gives a rough 95% confidence interval for the parameters.

Motivating Example: Cycling Safety

In the cycling safety study, after we found evidence that the rider’s distance from the curb was related to passing distance (the overall F test), we wanted to learn what kind of relationship existed (pairwise comparisons). Each pairwise comparison is defined by a linear combination of the parameters in our model.

Consider the treatment means model \(y_{ij}=\mu_j+\varepsilon_{ij}\). We are interested in which \(\mu_j\neq\mu_j'\).

Distribution of Least Squares Estimates

Recall in the linear model, the least squares estimate \(\widehat{\beta}=(X'X)^{-1}X'y\).

Its covariance is given by \(\text{Cov}(\widehat{\beta})=\sigma^2(X'X)^{-1}\).

In large samples, or when our errors are exactly normal, \(\widehat{\beta} \sim N\left(\beta,\sigma^2(X'X)^{-1}\right)\).

Linear Combinations

In order to test whether the means in group 1 and 2 are the same, we need to test a hypothesis about a linear combination of parameters.

The quantity \(\sum_j a_j \mu_j\) is a linear combination. It is called a contrast if \(\sum_j a_j=0\). Using matrix notation, we often express a hypothesis regarding a linear combination of regression coefficients as

\[\begin{eqnarray*} H_0: ~~~~&\theta& = C\beta = \theta_0 \\ H_a: ~~~~&\theta& = C\beta \neq \theta_0, \end{eqnarray*}\]

where often \(\theta_0=0\).

For example, if we have three groups in the model \(y_{ij}=\mu_j+\varepsilon_{ij}\) and want to test differences in all pairwise comparisons, we can use \(\beta=\begin{pmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \end{pmatrix}\), \(C=\begin{pmatrix} 1 & -1 & 0 \\ 1 & 0 & -1 \\ 0 & 1 & -1 \end{pmatrix}\), \(\theta_0=\begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix}\) so that our hypothesis is that \(\begin{pmatrix} \mu_1 - \mu_2 \\ \mu_1 - \mu_3 \\ \mu_2 - \mu_3 \end{pmatrix}=\begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix}\).

Distributional Results for Linear Combinations

Using basic properties of the multivariate normal distribution, \[C \widehat{\beta} \sim N\left(C\beta,\sigma^2 C(X'X)^{-1}C'\right).\]

Using this result, you can derive the standard error for any linear combination of parameter estimates, which can be used in constructing confidence intervals.

You could also fit a reduced model subject to the constraint you wish to test (e.g., same mean for groups 1 and 2), and then use either a partial F test or a likelihood-ratio test (F is special case of LRT) to evaluate the hypothesis that the reduced model is adequate.

Multi-Way ANOVA and Interactions

ANOVA is easily extended to accommodate any number of categorical variables. Variables may each contribute independently to a response, or they may work together to influence response values.

Interaction effects are important when the association between one independent variable and the response may depend on the level of another independent variable. Click this link for insight on what interactions imply in terms of group means

Interaction Example

For example, suppose that we are interested in a two-way ANOVA model in which the response \(y\) is a measure of headache pain, and the independent variables include the type of pill taken (placebo (j=1) or ibuprofen (j=2)) and the number of pills taken (k=1 or k=2). While we may expect lower pain if multiple ibuprofen pills are taken, we would not expect the same decrease in pain if multiple placebo pills were taken.

Consider the model \(y_{ijk}=\mu + \alpha I(j=2) + \beta I(k=2) + \gamma I(j=k=2)+\varepsilon_{ijk}\).

\(y_{ijk}=\mu + \alpha I(j=2) + \beta I(k=2) + \gamma I(j=k=2)+\varepsilon_{ijk}\)

In this model, the mean is parameterized as follows.

Drug	# of Pills	Mean
Placebo	1	\(\mu\)
Placebo	2	\(\mu+\beta\)
Ibuprofen	1	\(\mu+\alpha\)
Ibuprofen	2	\(\mu +\alpha+\beta+\gamma\)

What types of parameter values would we expect to see if there is an interaction in which there is a dose effect for Ibuprofen but not for placebo?

\(y_{ijk}=\mu + \alpha I(j=2) + \beta I(k=2) + \gamma I(j=k=2)+\varepsilon_{ijk}\)

In this model,

the expected difference in pain level moving from 1 to 2 ibuprofen pills is \(\mu+\alpha - \mu - \alpha - \beta - \gamma\)
the expected difference in pain level moving from 1 to 2 placebo pills is \(\mu - \mu - \beta\)
the expected drug effect for those taking one pill is \(\mu+\alpha-\mu=\alpha\)
the expected drug effect for those taking two pills is \(\mu+\alpha+\beta+\gamma - \mu - \beta=\alpha+\gamma\)

So no interaction (\(\gamma=0\)) means that the drug effect is the same regardless of the number of pills taken. For there to be no drug effect at all, we need \(\gamma=0\) and \(\alpha=0\).

R’s Most Exciting Data

Get ready – we are going to explore R’s most thrilling data – the famous tooth growth in Guinea pigs data!

Ahh, how cute! Our Dickensian guinea pig has a mystery to solve – which type of Vitamin C supplement is best for tooth growth!

Guinea pig dental problems are NOT fun. Our dataset (Crampton, 1947) contains as a response the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs, each of which receives one dose of vitamin C (0.5, 1, or 2 mg/day) via one of two delivery methods (orange juice (OJ) or ascorbic acid (VC)). Researchers wanted to know if the odontoblast length could be used as a marker of Vitamin C uptake, for the purposes of providing better nutritional supplementation to members of the Canadian armed forces (alas, the first of many injustices for Oliver Twisted Teeth – the study was not done to help little Guinea piggies).

library(ggplot2)
gp=ToothGrowth
gp$dose=as.factor(gp$dose)
# Default bar plot
p<- ggplot(gp, aes(x=dose, y=len, fill=supp)) + 
  geom_bar(stat="identity", position=position_dodge()) 
# Finished bar plot
p+labs(title="Odontoblast length by dose", x="Dose (mg)", y = "Length")+
   theme_classic() +
   scale_fill_manual(values=c('#999999','#E69F00'))

Looking at the boxplot of the growth data, what type of ANOVA model may be most appropriate? Take a moment to write a model in mathematical notation – then we will confer with groups on possible choices.

Your task!

Under your ANOVA model, write out (in terms of parameters) the means for each combination of supplement type and dose.
Fit your model and provide a \(\leq\) one-page summary of the analysis in language accessible to the general public.
Suppose that greater lengths are indicative of better absorption. Make a recommendation for the dose(s) and supplement type(s) to be used to deliver vitamin C to armed forces members, assuming that the goal is to maximize absorption of vitamin C. Use statistical evidence to support your recommendation.
Conduct diagnostic checks to see how well the assumptions behind your model are satisfied. Are there any reasons for concern about your model choice?