This lecture is based on Section 2.1 of Hoff.
Consider a very simple one-sample linear model given by \(y_i=\mu+\varepsilon_i\), \(\varepsilon_i \sim N(0,\sigma^2)\) in matrix notation, given by \[\begin{pmatrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{pmatrix}=\begin{pmatrix} 1 \\ 1 \\ \vdots \\ 1 \end{pmatrix} \begin{pmatrix} \mu \end{pmatrix} + \begin{pmatrix} \varepsilon_1 \\ \varepsilon_2 \\ \vdots \\ \varepsilon_n \end{pmatrix}\] with the vector \(\varepsilon \sim N(0_{n \times 1},\sigma^2I_{n \times n})\).
To find the MLE solve for the parameter values that make the first derivative equal to 0 (often we work with the log-likelihood as it is more convenient).
The log-likelihood is given by
\[\begin{eqnarray*} \ell(y,\mu,\sigma^2)&=& n \log \frac{1}{\sqrt(2\pi\sigma^2)} - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-\mu)^2 \\ &=& -\frac{n}{2}\log(2\pi\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-\mu)^2 \end{eqnarray*}\]To find the MLE of \(\mu\), take the derivative
\[\begin{eqnarray*} \frac{\partial \ell(\mu,\sigma^2)}{\partial \mu} &=& 0 -\frac{1}{2\sigma^2} \sum_{i=1}^n 2(y_i-\mu)(-1) \\ &=& \frac{1}{\sigma^2}\left(\sum_{i=1}^n y_i - n\mu \right) \end{eqnarray*}\]Setting this equal to zero, we obtain the MLE
\[\begin{eqnarray*} n\widehat{\mu}&=&\sum_{i=1}^n y_i \\ \widehat{\mu}&=& \frac{\sum_{i=1}^n y_i}{n}=\overline{y} \end{eqnarray*}\]To find the MLE of \(\sigma^2\) take the derivative of \(-\frac{n}{2}[\log(2\pi)+\log(\sigma^2)]- \frac{1}{2\sigma^2}\sum_{i=1}^n (y_i-\mu)^2\)
\[\begin{eqnarray*} \frac{\partial \ell(\mu,\sigma^2)}{\partial \sigma^2} &=& 0-\frac{n}{2}\frac{1}{\sigma^2} - \frac{1}{2(-1)\left(\sigma^2\right)^2}\sum_{i=1}^n (y_i-\mu)^2 \\ &=& -\frac{n}{2\sigma^2} + \frac{1}{2\left(\sigma^2\right)^2}\sum_{i=1}^n (y_i-\mu)^2 \end{eqnarray*}\]Setting to 0 and solving for the MLE, using the MLE of \(\mu\) we just found, we obtain
\(\widehat{\sigma}^2=\frac{1}{n}\sum_{i=1}^n (y_i-\overline{y})^2\).
Note this MLE of \(\sigma^2\) is not the sample variance \(s^2\). We will return to this point later in the course.
For any MLE \(\widehat{\theta}\),
For the hierarchical model, this gives us a method for getting approximate 95% confidence intervals for mean parameters (and functions of them).
The observed information matrix is the matrix of second derivatives of the negative log-likelihood function at the MLE (Hessian matrix): \[J(\widehat{\theta})=\left\{ -\frac{\partial^2 \ell(\theta \mid y)}{\partial \theta_j \partial \theta_k}\right\}|_{\theta=\widehat{\theta}}\]
The inverse of the information matrix gives us an estimate of the variance/covariance of MLE’s: \[\widehat{\text{Var}}(\widehat{\theta})\approx J^{-1}(\widehat{\theta})\]
The square roots of the diagonal elements of this matrix give approximate SE’s for the coefficients, and the MLE \(\pm\) 2 SE gives a rough 95% confidence interval for the parameters.
In the cycling safety study, after we found evidence that the rider’s distance from the curb was related to passing distance (the overall F test), we wanted to learn what kind of relationship existed (pairwise comparisons). Each pairwise comparison is defined by a linear combination of the parameters in our model.
Consider the treatment means model \(y_{ij}=\mu_j+\varepsilon_{ij}\). We are interested in which \(\mu_j\neq\mu_j'\).
Recall in the linear model, the least squares estimate \(\widehat{\beta}=(X'X)^{-1}X'y\).
Its covariance is given by \(\text{Cov}(\widehat{\beta})=\sigma^2(X'X)^{-1}\).
In large samples, or when our errors are exactly normal, \(\widehat{\beta} \sim N\left(\beta,\sigma^2(X'X)^{-1}\right)\).
In order to test whether the means in group 1 and 2 are the same, we need to test a hypothesis about a linear combination of parameters.
The quantity \(\sum_j a_j \mu_j\) is a linear combination. It is called a contrast if \(\sum_j a_j=0\). Using matrix notation, we often express a hypothesis regarding a linear combination of regression coefficients as
\[\begin{eqnarray*} H_0: ~~~~&\theta& = C\beta = \theta_0 \\ H_a: ~~~~&\theta& = C\beta \neq \theta_0, \end{eqnarray*}\]where often \(\theta_0=0\).
For example, if we have three groups in the model \(y_{ij}=\mu_j+\varepsilon_{ij}\) and want to test differences in all pairwise comparisons, we can use \(\beta=\begin{pmatrix} \mu_1 \\ \mu_2 \\ \mu_3 \end{pmatrix}\), \(C=\begin{pmatrix} 1 & -1 & 0 \\ 1 & 0 & -1 \\ 0 & 1 & -1 \end{pmatrix}\), \(\theta_0=\begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix}\) so that our hypothesis is that \(\begin{pmatrix} \mu_1 - \mu_2 \\ \mu_1 - \mu_3 \\ \mu_2 - \mu_3 \end{pmatrix}=\begin{pmatrix} 0 \\ 0 \\ 0 \end{pmatrix}\).
Using basic properties of the multivariate normal distribution, \[C \widehat{\beta} \sim N\left(C\beta,\sigma^2 C(X'X)^{-1}C'\right).\]
Using this result, you can derive the standard error for any linear combination of parameter estimates, which can be used in constructing confidence intervals.
You could also fit a reduced model subject to the constraint you wish to test (e.g., same mean for groups 1 and 2), and then use either a partial F test or a likelihood-ratio test (F is special case of LRT) to evaluate the hypothesis that the reduced model is adequate.
ANOVA is easily extended to accommodate any number of categorical variables. Variables may each contribute independently to a response, or they may work together to influence response values.
Interaction effects are important when the association between one independent variable and the response may depend on the level of another independent variable. Click this link for insight on what interactions imply in terms of group means
For example, suppose that we are interested in a two-way ANOVA model in which the response \(y\) is a measure of headache pain, and the independent variables include the type of pill taken (placebo (j=1) or ibuprofen (j=2)) and the number of pills taken (k=1 or k=2). While we may expect lower pain if multiple ibuprofen pills are taken, we would not expect the same decrease in pain if multiple placebo pills were taken.
Consider the model \(y_{ijk}=\mu + \alpha I(j=2) + \beta I(k=2) + \gamma I(j=k=2)+\varepsilon_{ijk}\).
\(y_{ijk}=\mu + \alpha I(j=2) + \beta I(k=2) + \gamma I(j=k=2)+\varepsilon_{ijk}\)
In this model, the mean is parameterized as follows.
Drug | # of Pills | Mean |
---|---|---|
Placebo | 1 | \(\mu\) |
Placebo | 2 | \(\mu+\beta\) |
Ibuprofen | 1 | \(\mu+\alpha\) |
Ibuprofen | 2 | \(\mu +\alpha+\beta+\gamma\) |
What types of parameter values would we expect to see if there is an interaction in which there is a dose effect for Ibuprofen but not for placebo?
\(y_{ijk}=\mu + \alpha I(j=2) + \beta I(k=2) + \gamma I(j=k=2)+\varepsilon_{ijk}\)
In this model,
So no interaction (\(\gamma=0\)) means that the drug effect is the same regardless of the number of pills taken. For there to be no drug effect at all, we need \(\gamma=0\) and \(\alpha=0\).
Get ready – we are going to explore R’s most thrilling data – the famous tooth growth in Guinea pigs data!
Ahh, how cute! Our Dickensian guinea pig has a mystery to solve – which type of Vitamin C supplement is best for tooth growth!
Guinea pig dental problems are NOT fun. Our dataset (Crampton, 1947) contains as a response the length of odontoblasts (cells responsible for tooth growth) in 60 guinea pigs, each of which receives one dose of vitamin C (0.5, 1, or 2 mg/day) via one of two delivery methods (orange juice (OJ) or ascorbic acid (VC)). Researchers wanted to know if the odontoblast length could be used as a marker of Vitamin C uptake, for the purposes of providing better nutritional supplementation to members of the Canadian armed forces (alas, the first of many injustices for Oliver Twisted Teeth – the study was not done to help little Guinea piggies).
library(ggplot2)
gp=ToothGrowth
gp$dose=as.factor(gp$dose)
# Default bar plot
p<- ggplot(gp, aes(x=dose, y=len, fill=supp)) +
geom_bar(stat="identity", position=position_dodge())
# Finished bar plot
p+labs(title="Odontoblast length by dose", x="Dose (mg)", y = "Length")+
theme_classic() +
scale_fill_manual(values=c('#999999','#E69F00'))
Looking at the boxplot of the growth data, what type of ANOVA model may be most appropriate? Take a moment to write a model in mathematical notation – then we will confer with groups on possible choices.