Maximum likelihood estimates (MLE’s) have excellent large-sample properties and are applicable in a wide variety of situations.
Examples of maximum likelihood estimates include the following:
The sample average \(\bar{X}\) of a group of independent and identically normally distributed observations \(X_1, \ldots, X_n\) is a maximum likelihood estimate.
Parameter estimates in a linear regression model fit to normally distributed data are maximum likelihood estimates.
Parameter estimates in a logistic regression model are maximum likelihood estimates.
Let \(L({\bf Y} \mid \theta)\) denote the likelihood function for data \({\bf Y}=(Y_1,Y_2,\ldots,Y_n)\) from some population described by the parameters \(\theta=(\theta_1,\theta_2,\ldots,\theta_p)\).
The maximum likelihood estimate of \(\theta\) is given by the estimator \(\widehat{\theta}=(\widehat{\theta}_1,\widehat{\theta}_2,\ldots, \widehat{\theta}_p)\) for which \[L({\bf Y} \mid \widehat{\theta}) > L({\bf Y} \mid \theta^*),\] where \(\theta^*\) is any other estimate of \(\theta\).
Thus the maximum likelihood estimate is the most “probable” or “likely” given the data.
Maximizing the likelihood \(L({\bf Y} \mid \theta)\) is equivalent to maximizing the natural logarithm \(\ln(L({\bf Y} \mid \theta))=\ell({\bf Y} \mid \theta)\), called the log-likelihood. The maximum likelihood estimates are typically found as the solutions of the \(p\) equations obtained by setting the \(p\) partial derivatives of \(\ell({\bf Y} \mid \theta)\) with respect to each \(\theta_j\), \(j=1,\ldots,p\), equal to zero.
Why do we solve the derivatives for zero? The derivative gives us the slope of the likelihood (or log-likelihood), and when the slope is zero, we know that we are at either a local minimum or local maximum. (The second derivative is negative for a maximum and positive for a minimum.)
When closed form expressions for maximum likelihood estimates do not exist, computer algorithms may be used to solve for the estimates.
Example: Let \(Y_i\), \(i=1, \ldots, n\) be iid normal random variables with mean \(\mu\) and variance \(\sigma^2\), so \(Y_i \sim N(\mu,\sigma^2)\), \(i=1,\ldots,n\).
\(\text{The density of } Y_i \text{ is given by }\)
\[\begin{eqnarray*} f(Y_i \mid \mu, \sigma^2)=(2 \pi)^{-\frac{1}{2}}( \sigma^2)^{-\frac{1}{2}} \exp \left\{-\frac{1}{2 \sigma^2}(Y_i-\mu)^2 \right\}. \end{eqnarray*}\]Find the maximum likelihood estimates of \(\mu\) and \(\sigma^2\).
is approximately \(N(0,1)\) when the sample size is large.
A test of \(H_0: \beta_1=0\) versus \(H_A: \beta_1 \ne 0\) can be based on the \(Z\) statistic \(\frac{\widehat{\beta}_1}{\sqrt{\widehat{\text{Var}}(\widehat{\beta}_1)}}\), which has approximately a \(N(0,1)\) distribution under \(H_0\). This test is called a Wald test.
where \(Pr(Z > Z_{1-\frac{\alpha}{2}})=\frac{\alpha}{2}\) when \(Z \sim N(0,1)\).
Testing \(\sigma^2\) is more complicated, and the Wald test for the hypothesis \(\sigma^2=0\) is not recommended because the value \(\sigma^2=0\) is on the boundary of the parameter space for \(\sigma^2\), a violation of a regularity condition for the validity of the test.