This lecture is based on Section 2.1 of Hoff.
Dr. Ian Walker at University of Bath carried out a project to investigate how drivers overtake bicyclists. His team modified a bicycle subtly to carry both a video system and an accurate ultrasonic distance sensor that could record the proximities of each passing vehicle.
The team then designed an experiment in which a cyclist (Dr. Walker) varied the distance he rode from the curb (the British spelling kerb is used in the dataset) from 0.25m to 1.25m in 0.25 m increments.
We will consider the outcome of passing distance \(y_{ij}\), which is the measured distance (in m) between the vehicle and the cyclist, as a function of the distance from the bike to the curb (indexed by \(j\)), as some cyclists have postulated that “the more room you take up, the more room they give you.” We’ll use these data to test this “Theory of Big.”
Our research question of interest is whether the distance from the bike to the curb is indeed related to the passing distance between the bike and a vehicle.
load("/Users/dunson/Documents/COURSES/STA610/NOTES/Hierarchical-Model-Notes/PsychBikeData.RData")
summary(PsychBikeData)
## vehicle colour passing distance
## ordinary :1708 blue :636 Min. :0.394
## minibus : 293 silver/grey:531 1st Qu.:1.303
## SUV/pickup : 143 red :378 Median :1.529
## bus : 46 white :333 Mean :1.564
## HGV : 82 black :262 3rd Qu.:1.790
## taxi : 49 green :149 Max. :3.787
## powered two-wheeler: 34 (Other) : 66
## street Time
## one way, one lane : 9 Min. :1904-01-01 07:46:00
## one way, 2 lanes : 13 1st Qu.:1904-01-01 10:14:00
## regular urban street : 655 Median :1904-01-01 12:13:00
## regular residential street: 39 Mean :1904-01-01 12:40:09
## main road, regular :1637 3rd Qu.:1904-01-01 15:30:00
## rural : 2 Max. :1904-01-01 17:12:00
##
## hour helmet kerb Bikelane
## Min. :1904-01-01 07:00:00 no :1206 Min. :0.2500 no :2305
## 1st Qu.:1904-01-01 10:00:00 yes:1149 1st Qu.:0.2500 yes: 50
## Median :1904-01-01 12:00:00 Median :0.5000
## Mean :1904-01-01 12:05:38 Mean :0.6702
## 3rd Qu.:1904-01-01 15:00:00 3rd Qu.:1.0000
## Max. :1904-01-01 17:00:00 Max. :1.2500
##
## City date
## Salisbury :1905 Min. :2006-05-11 00:00:00
## Bristol : 450 1st Qu.:2006-05-20 00:00:00
## Portsmouth: 0 Median :2006-05-27 00:00:00
## Mean :2006-05-27 12:08:15
## 3rd Qu.:2006-05-31 00:00:00
## Max. :2006-06-19 00:00:00
##
Research question: is distance from curb related to passing distance?
##
## 0.25 0.5 0.75 1 1.25
## 670 545 339 469 332
## 0.25 0.5 0.75 1 1.25
## 1.698054 1.590473 1.505519 1.490584 1.412813
Consider the model
\[\begin{eqnarray*} y_{ij}&=&\mu+\alpha_j+\varepsilon_{ij} \text{ (treatment effects model)} \\ &=& \mu_j + \varepsilon_{ij} ~~~~~~~\text{ (treatment means model)} \end{eqnarray*}\] where \(\mu_{j}=\mu+\alpha_j\). These two equations are simply alternate parameterizations of the same model. In each case, we estimate a separate mean passing distance \(\mu_j=\mu+\alpha_j\) as a function of each of the 5 curb distances tested.
\[\begin{eqnarray*} y_{ij}&=&\mu+\alpha_j+\varepsilon_{ij}=\mu_j + \varepsilon_{ij} \end{eqnarray*}\]
\(\mu\): expected passing distance (grand mean)
\(\mu_j\): expected passing distance for curb distance \(j\)
\(\alpha_j\): deviation between overall expected passing distance and expected passing distance for curb distance \(j\)
\(\varepsilon_{ij}\): deviations of observed passing distance from curb-distance-specific expectations
In the standard ANOVA model \(\sum_j \alpha_j = 0\) is assumed so that \(\mu\) represents an overall mean across groups.
Another coding scheme: set one \(\alpha_j=0\) so that \(\mu\) represents the expected passing distance in that particular group, and remaining \(\alpha_j\) measure differences from expected passing distance in that reference group
We also assume that \(\varepsilon_{ij} \overset{iid}{\sim} f(\varepsilon)\) with \(E(\varepsilon_{ij})=0\) within all groups \(j\).
The expected passing distance for occasion \(i\) in with curb distance \(j\) is then
\[E(y_{ij} \mid \mu, \alpha_1, \cdots, \alpha_J)=E(\mu+\alpha_j+\varepsilon_{ij} \mid \mu, \alpha_1, \cdots, \alpha_J)=\mu+\alpha_j=\mu_j\]
If we assume \(f(\varepsilon)=N\left(0,\sigma^2\right)\), then our model is \(y_{ij} \sim N\left(\mu+\alpha_j,\sigma^2\right)\) or equivalently \(y_{ij} \sim N\left(\mu_j,\sigma^2\right)\).
Let \(\widehat{\mu}=\left(\widehat{\mu}_1,\cdots,\widehat{\mu}_J\right)\) be our estimates of the unknown parameters \(\mu=(\mu_1,\cdots,\mu_J)\). The residual for \(y_{ij}\) is the difference between the observed \(y_{ij}\) and our fitted value \(\widehat{y}_{ij}\) and is given by
\(\widehat{\varepsilon}_{ij}=y_{ij}-\widehat{y}_{ij}=y_{ij}-\widehat{\mu}_{j}\).
The ordinary least squares (OLS) estimate of \(\mu\), \(\widehat{\mu}_{OLS}\), is the value that minimizes the sum of squared residuals (sum of squared errors) given by
\[SSE(\mu)=\sum_j \sum_i (y_{ij}-\mu_j)^2.\]
You can show (homework 1!) that the OLS estimates are given by
\((\widehat{\mu}_1,\cdots,\widehat{\mu}_J)=(\overline{y}_1,\cdots,\overline{y}_J)\), where \(\overline{y}_j\) is the sample mean in group \(j\).
\(\widehat{\mu}=\overline{y}\), where \(\overline{y}\) is the grand mean over all observations
\(\widehat{\mu}=\frac{1}{J}\sum_j \widehat{\mu}_j\) when the sample sizes in each group j, \(n_j\), are equal for all groups
\(\widehat{\alpha}_j=\widehat{\mu}_j-\widehat{\mu}=\overline{y}_{j}-\overline{y}\)
A helpful mnemonic may be the following “decomposition” of a single data point:
\[\begin{eqnarray*} y_{ij}&=& \overline{y} + (\overline{y}_{j}-\overline{y}) + (y_{ij}-\overline{y}_{j}) \\ &=& \widehat{\mu} ~~ + ~~~~~\widehat{\alpha}_j ~~~~~ + ~~~~~ \widehat{\varepsilon}_{ij} \end{eqnarray*}\]
Based on those ideas, let’s decompose the variability of the data around the grand mean into variation within groups (error) and variation between or across groups (group effects). For simplicity, suppose we have \(J\) groups with \(n_j\) observations in each group.
We break down the total variation of the data around the overall mean as follows: \[SST=SSG+SSE,\] where SST is the total sum of squared deviations around the overall mean, SSG is the portion of the total sum of squares due to group differences, and SSE is the portion of the total sum of squares due to differences between the individual observations and their own group means.
We define the sums of squares as follows:
SST=\(\sum_{j=1}^J \sum_{i=1}^{n_j} \left(y_{ij}-\overline{y}\right)^2\)
SSG=\(\sum_{j=1}^J \sum_{i=1}^{n_j} \left(\overline{y}_{\cdot j}-\overline{y}\right)^2\)
SSE=\(\sum_{j=1}^J \sum_{i=1}^{n_j} \left(y_{ij}-\overline{y}_{\cdot j}\right)^2\)
The main use of ANOVA is to evaluate the hypothesis that there are no differences across groups, e.g. \(H_0: \mu_j=\mu_{j'}\) \(\forall ~ j \neq j'\) against the alternative that at least one mean is different.
Testing in ANOVA involves comparison of the mean squares for groups and the mean squares for error (we’ll come back to why this is sensible) and can be summarized in the ANOVA table. Let \(N=\sum_j n_j\) be the overall sample size.
Source | DF | SS | MS | F | p |
---|---|---|---|---|---|
Groups | \(J-1\) | SSG | \(MSG=\frac{SSG}{J-1}\) | \(\frac{MSG}{MSE}\) | from \(F_{J-1,N-J}\) |
Error | \(N-J\) | SSE | \(MSE=\frac{SSE}{N-J}\) | ||
Total | \(N-1\) | SST |
Using this Shiny app you can explore the roles of within-group and between-group variance in ANOVA.
The MSE can be written
\[\begin{eqnarray*} MSE &=& \frac{SSE}{\sum_j (n_j-1)} \\ &=& \frac{\sum_{j=1}^J \sum_{i=1}^{n_j} \left(y_{ij}-\overline{y}_{\cdot j}\right)^2}{\sum_j (n_j-1)} \\ &=& \frac{\sum_{i=1}^{n_1} \left(y_{i1}-\overline{y}_{\cdot 1}\right)^2 + \cdots + \sum_{i=1}^{n_J} \left(y_{iJ}-\overline{y}_{\cdot J}\right)^2}{(n_1-1)+ \cdots + (n_J-1)} \\ &=& \frac{(n_1-1)s_1^2 + \cdots+ (n_J-1)s_J^2}{(n_1-1)+ \cdots + (n_J-1)} \end{eqnarray*}\]
In ANOVA, we typically assume independence of observations and equal variances within all the groups. We see that the \(MSE=\frac{(n_1-1)s_1^2 + \cdots+ (n_J-1)s_J^2}{(n_1-1)+ \cdots + (n_J-1)}\) is a pooled estimate of the within-group sample variance, and you can show that \(E(MSE)=\sigma^2\) if our assumption of equal variances holds.
Using algebra, you can show that \(E(MSG)=\sigma^2 + \frac{\sum n_j (\mu_j-\mu)^2}{J-1}\). Under the null hypothesis that all the group means are the same, this expectation reduces to \(\sigma^2\).
Thus under \(H_0\), \(E\left(F=\frac{MSG}{MSE}\right)=1\), but if the group means are in fact different from each other, we expect MSG\(>\sigma^2\) and \(F>1\).
Under the assumption that \(\varepsilon_{ij} \overset{iid}{\sim} N(0,\sigma^2)\), if \(H_0\) is also true, then
\[F=\frac{MSG}{MSE} \sim F_{J-1,N-J}.\]
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(kerb) 4 23.7 5.925 43.18 <2e-16 ***
## Residuals 2350 322.4 0.137
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
This F test indicates that it is very unlikely we would see differences in passing distance as large as we did under the null hypothesis that all groups have the same mean. There is a difference in passing distance for at least one set of curb distances.
After an overall F test is rejected, the next obvious questions is “Why?” It could be the case that each group is different from all other groups, or maybe only one group differs from the other. Many types of step-down tests are available with varying degrees of protection of the type I error rate.
The phrasing step-down is a reminder that these tests are not conducted unless the overall F test is rejected. Some options include the following.
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = `passing distance` ~ factor(kerb), data = PsychBikeData)
##
## $`factor(kerb)`
## diff lwr upr p adj
## 0.5-0.25 -0.10758034 -0.16590883 -0.04925185 0.0000051
## 0.75-0.25 -0.19253456 -0.25993094 -0.12513817 0.0000000
## 1-0.25 -0.20746951 -0.26834834 -0.14659068 0.0000000
## 1.25-0.25 -0.28524048 -0.35310701 -0.21737394 0.0000000
## 0.75-0.5 -0.08495422 -0.15489916 -0.01500928 0.0082429
## 1-0.5 -0.09988917 -0.16357790 -0.03620045 0.0001874
## 1.25-0.5 -0.17766014 -0.24805821 -0.10726207 0.0000000
## 1-0.75 -0.01493495 -0.08702041 0.05715051 0.9799835
## 1.25-0.75 -0.09270592 -0.17078247 -0.01462937 0.0105846
## 1.25-1 -0.07777097 -0.15029619 -0.00524575 0.0284527
Because we’re curious, we’ll look for where the difference appears to be with Tukey’s HSD test, which is basically a regular t-test that controls the family-wide error rate. Looks like the passing distance is similar for curb distances of 0.75m and 1m but different for other combinations. Contrary to the hypothesis, the passing distance was greater when bikers rode closer to the curb.
This approach controls type I error rate (compared to testing group differences separately)
Easy implementation and reporting
Rejecting \(H_0\) does not mean there are no similarities across groups; in fact, \(\overline{y}_{\cdot j}\) can be a pretty inefficient estimate of \(\mu_j\)
As we move towards more complex models, it will be a lot easier to work with the model in matrix form rather than in scalar form.
The general linear model is written in matrix form as \(y=X\beta+\varepsilon\). Consider the treatment means model \(y_{ij}=\mu_j+\varepsilon_{ij}\). We can represent this model in matrix form as follows.
\[\begin{eqnarray*} y &=& X \mu + \varepsilon \\ \begin{bmatrix} y_{11} \\ y_{12} \\ \vdots \\ y_{1n_1} \\ y_{21} \\ \vdots \\ y_{Jn_J} \end{bmatrix} &=& \begin{bmatrix} 1 & 0 & \cdots & 0 \\ 1 & 0 & \cdots & 0 \\ \vdots & & & \\ 1 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots \\ \vdots & \ddots & & \vdots \\ 0 & \cdots & & 1 \end{bmatrix}_{(\sum_j n_j) \times J} \begin{bmatrix} \mu_1 \\ \vdots \\ \mu_J \end{bmatrix} + \begin{bmatrix} \varepsilon_{11} \\ \varepsilon_{12} \\ \vdots \\ \varepsilon_{1n_1} \\ \varepsilon_{21} \\ \vdots \\ \varepsilon_{Jn_J} \end{bmatrix} \end{eqnarray*}\]
In ANOVA, the \(X\) matrix has a special form, which is sometimes summarized by the essence matrix, which shows the unique rows of the matrix.
The essence matrix in the treatment effects model is a column of ones appended to the left of the identity matrix: \(\begin{bmatrix} 1 & 1 & 0 & 0 \\ 1 & 0 & 1 & 0 \\ 1 & 0 & 0 & 1 \end{bmatrix}\)
The essence matrix in reference cell coding is given by \(\begin{bmatrix} 1 & 1 & 0 \\ 1 & 0 & 1 \\ 1 & 0 & 0 \end{bmatrix}\) (assuming without loss of generality that the last group is the reference)
In each case, the row corresponding to group \(j\) is repeated \(n_j\) times so that the full \(X\) matrix has \(\sum_j n_j\) rows.
Consider a study in which we wish to compare pain levels among children recovering from surgery who were randomized to one of three groups: audio books (chosen by each child), music (the child could select the playlist), or control (noise cancelling headphones). After 30 minutes of one of the three conditions, each child rated his or her pain status using the following scale.
The study is described here, but we will consider similar data from a hypothetical smaller study (n=15 total). Suppose the data from the three groups is as follows.
Audio books: 5,6,7,2,6 Music: 5,4,4,7,6 Control: 4,8,7,6,10
Write each element of \(y=X\beta+\varepsilon\) in matrix or vector form using a group means coding scheme. Once you’ve finished, repeat the exercise using reference cell coding.
The least squares estimate minimizes the sum of squared residuals given by
\[\begin{eqnarray*} SSE&=& (y-\widehat{y})'(y-\widehat{y}) \\ &=& (y-X\widehat{\mu})'(y-X\widehat{\mu}) \\ &=& y'y-2\widehat{\mu}'X'y+\widehat{\mu}'X'X\widehat{\mu} \end{eqnarray*}\]
To find \(\mu\) that minimizes the SSE\(=y'y-2\mu'X'y+\mu'X'X\mu\) take derivatives:
\[\frac{\partial SSE}{\partial \mu}=0-2X'y+2X'X\mu\] and then solve for 0:
\[0=-2X'y+2X'X\widehat{\mu}\] to get the normal equations \[(X'X)_{p \times p}\widehat{\mu}=X'y.\]
When \(X\) has rank \(p\), we solve the normal equations
\[\begin{eqnarray*} (X'X)\widehat{\mu}&=&X'y \\ (X'X)^{-1}(X'X)\widehat{\mu}=(X'X)^{-1}X'y \\ \widehat{\mu}=(X'X)^{-1}X'y \end{eqnarray*}\]
Our least squares estimate in this case is unique, the best linear unbiased estimate, and if our errors are Gaussian, \(\widehat{\mu}\) is the MLE and minimum variance unbiased estimator.
When \(X\) has rank \(<p\), we can use a generalized inverse, but \(\widehat{\mu}\) is not unique, though the predicted values and residuals are unique.