ANOVA

Reading

This lecture is based on Section 2.1 of Hoff.

Motivating Example: Cycling Safety

Dr. Ian Walker at University of Bath carried out a project to investigate how drivers overtake bicyclists. His team modified a bicycle subtly to carry both a video system and an accurate ultrasonic distance sensor that could record the proximities of each passing vehicle.

The team then designed an experiment in which a cyclist (Dr. Walker) varied the distance he rode from the curb (the British spelling kerb is used in the dataset) from 0.25m to 1.25m in 0.25 m increments.

We will consider the outcome of passing distance \(y_{ij}\), which is the measured distance (in m) between the vehicle and the cyclist, as a function of the distance from the bike to the curb (indexed by \(j\)), as some cyclists have postulated that “the more room you take up, the more room they give you.” We’ll use these data to test this “Theory of Big.”

Our research question of interest is whether the distance from the bike to the curb is indeed related to the passing distance between the bike and a vehicle.

Exploring the data

load("/Users/dunson/Documents/COURSES/STA610/NOTES/Hierarchical-Model-Notes/PsychBikeData.RData")
summary(PsychBikeData)
##                 vehicle             colour    passing distance
##  ordinary           :1708   blue       :636   Min.   :0.394   
##  minibus            : 293   silver/grey:531   1st Qu.:1.303   
##  SUV/pickup         : 143   red        :378   Median :1.529   
##  bus                :  46   white      :333   Mean   :1.564   
##  HGV                :  82   black      :262   3rd Qu.:1.790   
##  taxi               :  49   green      :149   Max.   :3.787   
##  powered two-wheeler:  34   (Other)    : 66                   
##                         street          Time                    
##  one way, one lane         :   9   Min.   :1904-01-01 07:46:00  
##  one way, 2 lanes          :  13   1st Qu.:1904-01-01 10:14:00  
##  regular urban street      : 655   Median :1904-01-01 12:13:00  
##  regular residential street:  39   Mean   :1904-01-01 12:40:09  
##  main road, regular        :1637   3rd Qu.:1904-01-01 15:30:00  
##  rural                     :   2   Max.   :1904-01-01 17:12:00  
##                                                                 
##       hour                     helmet          kerb        Bikelane  
##  Min.   :1904-01-01 07:00:00   no :1206   Min.   :0.2500   no :2305  
##  1st Qu.:1904-01-01 10:00:00   yes:1149   1st Qu.:0.2500   yes:  50  
##  Median :1904-01-01 12:00:00              Median :0.5000             
##  Mean   :1904-01-01 12:05:38              Mean   :0.6702             
##  3rd Qu.:1904-01-01 15:00:00              3rd Qu.:1.0000             
##  Max.   :1904-01-01 17:00:00              Max.   :1.2500             
##                                                                      
##          City           date                    
##  Salisbury :1905   Min.   :2006-05-11 00:00:00  
##  Bristol   : 450   1st Qu.:2006-05-20 00:00:00  
##  Portsmouth:   0   Median :2006-05-27 00:00:00  
##                    Mean   :2006-05-27 12:08:15  
##                    3rd Qu.:2006-05-31 00:00:00  
##                    Max.   :2006-06-19 00:00:00  
## 

Research question: is distance from curb related to passing distance?

table(PsychBikeData$kerb)
## 
## 0.25  0.5 0.75    1 1.25 
##  670  545  339  469  332
tapply(PsychBikeData$`passing distance`,PsychBikeData$kerb,mean)
##     0.25      0.5     0.75        1     1.25 
## 1.698054 1.590473 1.505519 1.490584 1.412813

ANOVA Model

Consider the model

\[\begin{eqnarray*} y_{ij}&=&\mu+\alpha_j+\varepsilon_{ij} \text{ (treatment effects model)} \\ &=& \mu_j + \varepsilon_{ij} ~~~~~~~\text{ (treatment means model)} \end{eqnarray*}\] where \(\mu_{j}=\mu+\alpha_j\). These two equations are simply alternate parameterizations of the same model. In each case, we estimate a separate mean passing distance \(\mu_j=\mu+\alpha_j\) as a function of each of the 5 curb distances tested.

\[\begin{eqnarray*} y_{ij}&=&\mu+\alpha_j+\varepsilon_{ij}=\mu_j + \varepsilon_{ij} \end{eqnarray*}\]

  • \(\mu\): expected passing distance (grand mean)

  • \(\mu_j\): expected passing distance for curb distance \(j\)

  • \(\alpha_j\): deviation between overall expected passing distance and expected passing distance for curb distance \(j\)

  • \(\varepsilon_{ij}\): deviations of observed passing distance from curb-distance-specific expectations

  • In the standard ANOVA model \(\sum_j \alpha_j = 0\) is assumed so that \(\mu\) represents an overall mean across groups.

  • Another coding scheme: set one \(\alpha_j=0\) so that \(\mu\) represents the expected passing distance in that particular group, and remaining \(\alpha_j\) measure differences from expected passing distance in that reference group

We also assume that \(\varepsilon_{ij} \overset{iid}{\sim} f(\varepsilon)\) with \(E(\varepsilon_{ij})=0\) within all groups \(j\).

The expected passing distance for occasion \(i\) in with curb distance \(j\) is then

\[E(y_{ij} \mid \mu, \alpha_1, \cdots, \alpha_J)=E(\mu+\alpha_j+\varepsilon_{ij} \mid \mu, \alpha_1, \cdots, \alpha_J)=\mu+\alpha_j=\mu_j\]


If we assume \(f(\varepsilon)=N\left(0,\sigma^2\right)\), then our model is \(y_{ij} \sim N\left(\mu+\alpha_j,\sigma^2\right)\) or equivalently \(y_{ij} \sim N\left(\mu_j,\sigma^2\right)\).

Parameter estimation

Let \(\widehat{\mu}=\left(\widehat{\mu}_1,\cdots,\widehat{\mu}_J\right)\) be our estimates of the unknown parameters \(\mu=(\mu_1,\cdots,\mu_J)\). The residual for \(y_{ij}\) is the difference between the observed \(y_{ij}\) and our fitted value \(\widehat{y}_{ij}\) and is given by

\(\widehat{\varepsilon}_{ij}=y_{ij}-\widehat{y}_{ij}=y_{ij}-\widehat{\mu}_{j}\).

The ordinary least squares (OLS) estimate of \(\mu\), \(\widehat{\mu}_{OLS}\), is the value that minimizes the sum of squared residuals (sum of squared errors) given by

\[SSE(\mu)=\sum_j \sum_i (y_{ij}-\mu_j)^2.\]

OLS Estimates

You can show (homework 1!) that the OLS estimates are given by

  • \((\widehat{\mu}_1,\cdots,\widehat{\mu}_J)=(\overline{y}_1,\cdots,\overline{y}_J)\), where \(\overline{y}_j\) is the sample mean in group \(j\).

  • \(\widehat{\mu}=\overline{y}\), where \(\overline{y}\) is the grand mean over all observations

  • \(\widehat{\mu}=\frac{1}{J}\sum_j \widehat{\mu}_j\) when the sample sizes in each group j, \(n_j\), are equal for all groups

  • \(\widehat{\alpha}_j=\widehat{\mu}_j-\widehat{\mu}=\overline{y}_{j}-\overline{y}\)

A helpful mnemonic may be the following “decomposition” of a single data point:

\[\begin{eqnarray*} y_{ij}&=& \overline{y} + (\overline{y}_{j}-\overline{y}) + (y_{ij}-\overline{y}_{j}) \\ &=& \widehat{\mu} ~~ + ~~~~~\widehat{\alpha}_j ~~~~~ + ~~~~~ \widehat{\varepsilon}_{ij} \end{eqnarray*}\]

Sums of Squares

Based on those ideas, let’s decompose the variability of the data around the grand mean into variation within groups (error) and variation between or across groups (group effects). For simplicity, suppose we have \(J\) groups with \(n_j\) observations in each group.

We break down the total variation of the data around the overall mean as follows: \[SST=SSG+SSE,\] where SST is the total sum of squared deviations around the overall mean, SSG is the portion of the total sum of squares due to group differences, and SSE is the portion of the total sum of squares due to differences between the individual observations and their own group means.

Sums of Squares

We define the sums of squares as follows:

  • SST=\(\sum_{j=1}^J \sum_{i=1}^{n_j} \left(y_{ij}-\overline{y}\right)^2\)

  • SSG=\(\sum_{j=1}^J \sum_{i=1}^{n_j} \left(\overline{y}_{\cdot j}-\overline{y}\right)^2\)

  • SSE=\(\sum_{j=1}^J \sum_{i=1}^{n_j} \left(y_{ij}-\overline{y}_{\cdot j}\right)^2\)

ANOVA Table

The main use of ANOVA is to evaluate the hypothesis that there are no differences across groups, e.g. \(H_0: \mu_j=\mu_{j'}\) \(\forall ~ j \neq j'\) against the alternative that at least one mean is different.

Testing in ANOVA involves comparison of the mean squares for groups and the mean squares for error (we’ll come back to why this is sensible) and can be summarized in the ANOVA table. Let \(N=\sum_j n_j\) be the overall sample size.

Source DF SS MS F p
Groups \(J-1\) SSG \(MSG=\frac{SSG}{J-1}\) \(\frac{MSG}{MSE}\) from \(F_{J-1,N-J}\)
Error \(N-J\) SSE \(MSE=\frac{SSE}{N-J}\)
Total \(N-1\) SST

Using this Shiny app you can explore the roles of within-group and between-group variance in ANOVA.

Here you see a situation with large within-group variance relative to the between-group variance (e.g., not much of a group effect relative to the variability within groups)

In this case, the means are further apart and the between-group variance is larger than in the prior figure, and differences among groups are more apparent.

The MSE can be written

\[\begin{eqnarray*} MSE &=& \frac{SSE}{\sum_j (n_j-1)} \\ &=& \frac{\sum_{j=1}^J \sum_{i=1}^{n_j} \left(y_{ij}-\overline{y}_{\cdot j}\right)^2}{\sum_j (n_j-1)} \\ &=& \frac{\sum_{i=1}^{n_1} \left(y_{i1}-\overline{y}_{\cdot 1}\right)^2 + \cdots + \sum_{i=1}^{n_J} \left(y_{iJ}-\overline{y}_{\cdot J}\right)^2}{(n_1-1)+ \cdots + (n_J-1)} \\ &=& \frac{(n_1-1)s_1^2 + \cdots+ (n_J-1)s_J^2}{(n_1-1)+ \cdots + (n_J-1)} \end{eqnarray*}\]

In ANOVA, we typically assume independence of observations and equal variances within all the groups. We see that the \(MSE=\frac{(n_1-1)s_1^2 + \cdots+ (n_J-1)s_J^2}{(n_1-1)+ \cdots + (n_J-1)}\) is a pooled estimate of the within-group sample variance, and you can show that \(E(MSE)=\sigma^2\) if our assumption of equal variances holds.

Using algebra, you can show that \(E(MSG)=\sigma^2 + \frac{\sum n_j (\mu_j-\mu)^2}{J-1}\). Under the null hypothesis that all the group means are the same, this expectation reduces to \(\sigma^2\).

Thus under \(H_0\), \(E\left(F=\frac{MSG}{MSE}\right)=1\), but if the group means are in fact different from each other, we expect MSG\(>\sigma^2\) and \(F>1\).

Under the assumption that \(\varepsilon_{ij} \overset{iid}{\sim} N(0,\sigma^2)\), if \(H_0\) is also true, then

\[F=\frac{MSG}{MSE} \sim F_{J-1,N-J}.\]

Back to Passing Bikes

aov.res=aov(`passing distance`~factor(kerb),data=PsychBikeData)
summary(aov.res)
##                Df Sum Sq Mean Sq F value Pr(>F)    
## factor(kerb)    4   23.7   5.925   43.18 <2e-16 ***
## Residuals    2350  322.4   0.137                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This F test indicates that it is very unlikely we would see differences in passing distance as large as we did under the null hypothesis that all groups have the same mean. There is a difference in passing distance for at least one set of curb distances.

Where is the difference?

After an overall F test is rejected, the next obvious questions is “Why?” It could be the case that each group is different from all other groups, or maybe only one group differs from the other. Many types of step-down tests are available with varying degrees of protection of the type I error rate.

Step-down tests

The phrasing step-down is a reminder that these tests are not conducted unless the overall F test is rejected. Some options include the following.

  • The Bonferroni comparison using the significance level \(\frac{\alpha}{\text{number of tests}}\) is generally the most conservative and is not guaranteed to find a signficant pairwise comparison even if the overall F test is rejected.
  • Fisher’s LSD test is generally the least conservative, as it involves regular pairwise t-tests with no correction to \(\alpha\), with the type I error rate protection in the sense of not carrying out the pairwise tests if the overall F test is not rejected.
  • Many methods lie in between these extremes, with some additional protection of the type I error rate after rejection of the overall null hypothesis, including Tukey’s method, Scheffe’s method, ….

TukeyHSD(aov.res)
##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = `passing distance` ~ factor(kerb), data = PsychBikeData)
## 
## $`factor(kerb)`
##                  diff         lwr         upr     p adj
## 0.5-0.25  -0.10758034 -0.16590883 -0.04925185 0.0000051
## 0.75-0.25 -0.19253456 -0.25993094 -0.12513817 0.0000000
## 1-0.25    -0.20746951 -0.26834834 -0.14659068 0.0000000
## 1.25-0.25 -0.28524048 -0.35310701 -0.21737394 0.0000000
## 0.75-0.5  -0.08495422 -0.15489916 -0.01500928 0.0082429
## 1-0.5     -0.09988917 -0.16357790 -0.03620045 0.0001874
## 1.25-0.5  -0.17766014 -0.24805821 -0.10726207 0.0000000
## 1-0.75    -0.01493495 -0.08702041  0.05715051 0.9799835
## 1.25-0.75 -0.09270592 -0.17078247 -0.01462937 0.0105846
## 1.25-1    -0.07777097 -0.15029619 -0.00524575 0.0284527

Because we’re curious, we’ll look for where the difference appears to be with Tukey’s HSD test, which is basically a regular t-test that controls the family-wide error rate. Looks like the passing distance is similar for curb distances of 0.75m and 1m but different for other combinations. Contrary to the hypothesis, the passing distance was greater when bikers rode closer to the curb.

Residual plot

Advantages and Drawbacks of Classical Approach

  • This approach controls type I error rate (compared to testing group differences separately)

  • Easy implementation and reporting

  • Rejecting \(H_0\) does not mean there are no similarities across groups; in fact, \(\overline{y}_{\cdot j}\) can be a pretty inefficient estimate of \(\mu_j\)

Matrix Formulation

As we move towards more complex models, it will be a lot easier to work with the model in matrix form rather than in scalar form.

The general linear model is written in matrix form as \(y=X\beta+\varepsilon\). Consider the treatment means model \(y_{ij}=\mu_j+\varepsilon_{ij}\). We can represent this model in matrix form as follows.

\[\begin{eqnarray*} y &=& X \mu + \varepsilon \\ \begin{bmatrix} y_{11} \\ y_{12} \\ \vdots \\ y_{1n_1} \\ y_{21} \\ \vdots \\ y_{Jn_J} \end{bmatrix} &=& \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 1 & 0 & \cdots & 0 \\ \vdots & & & \\ 1 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots \\ \vdots & \ddots & & \vdots \\ 0 & \cdots & & 1 \end{bmatrix}_{(\sum_j n_j) \times J} \begin{bmatrix} \mu_1 \\ \vdots \\ \mu_J \end{bmatrix} + \begin{bmatrix} \varepsilon_{11} \\ \varepsilon_{12} \\ \vdots \\ \varepsilon_{1n_1} \\ \varepsilon_{21} \\ \vdots \\ \varepsilon_{Jn_J} \end{bmatrix} \end{eqnarray*}\]

Aside: Essence Matrix

In ANOVA, the \(X\) matrix has a special form, which is sometimes summarized by the essence matrix, which shows the unique rows of the matrix.

  • The essence matrix of a treatment means formulation of ANOVA is the \(J \times J\) identity matrix, with each row repeated \(n_j\) times: if \(j=3\) we have \(\begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix}\)

  • The essence matrix in the treatment effects model is a column of ones appended to the left of the identity matrix: \(\begin{bmatrix} 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \end{bmatrix}\)

  • The essence matrix in reference cell coding is given by \(\begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 0 \end{bmatrix}\) (assuming without loss of generality that the last group is the reference)

In each case, the row corresponding to group \(j\) is repeated \(n_j\) times so that the full \(X\) matrix has \(\sum_j n_j\) rows.

Simple Example

Consider a study in which we wish to compare pain levels among children recovering from surgery who were randomized to one of three groups: audio books (chosen by each child), music (the child could select the playlist), or control (noise cancelling headphones). After 30 minutes of one of the three conditions, each child rated his or her pain status using the following scale.

The study is described here, but we will consider similar data from a hypothetical smaller study (n=15 total). Suppose the data from the three groups is as follows.

Audio books: 5,6,7,2,6 Music: 5,4,4,7,6 Control: 4,8,7,6,10

Write each element of \(y=X\beta+\varepsilon\) in matrix or vector form using a group means coding scheme. Once you’ve finished, repeat the exercise using reference cell coding.

Least Squares Estimation

The least squares estimate minimizes the sum of squared residuals given by

\[\begin{eqnarray*} SSE&=& (y-\widehat{y})'(y-\widehat{y}) \\ &=& (y-X\widehat{\mu})'(y-X\widehat{\mu}) \\ &=& y'y-2\widehat{\mu}'X'y+\widehat{\mu}'X'X\widehat{\mu} \end{eqnarray*}\]

To find \(\mu\) that minimizes the SSE\(=y'y-2\mu'X'y+\mu'X'X\mu\) take derivatives:

\[\frac{\partial SSE}{\partial \mu}=0-2X'y+2X'X\mu\] and then solve for 0:

\[0=-2X'y+2X'X\widehat{\mu}\] to get the normal equations \[(X'X)_{p \times p}\widehat{\mu}=X'y.\]

Read more here if you’re rusty on matrix differentiation.

When \(X\) has rank \(p\), we solve the normal equations

\[\begin{eqnarray*} (X'X)\widehat{\mu}&=&X'y \\ (X'X)^{-1}(X'X)\widehat{\mu}=(X'X)^{-1}X'y \\ \widehat{\mu}=(X'X)^{-1}X'y \end{eqnarray*}\]

Our least squares estimate in this case is unique, the best linear unbiased estimate, and if our errors are Gaussian, \(\widehat{\mu}\) is the MLE and minimum variance unbiased estimator.

When \(X\) has rank \(<p\), we can use a generalized inverse, but \(\widehat{\mu}\) is not unique, though the predicted values and residuals are unique.