7 Estimating the Model Parameters

How to get \(b_0\) and \(b_1\): least squares & maximum likelihood…

There are two approaches to estimating the parameters \(\beta_0\) and \(\beta_1\) in the regression model. The oldest and most tradiational approach is using the idea of least squares. A more general approach uses the idea of maximum likelihood (see below). Fortunately, for simple linear regression, the estimates for \(\beta_0\) and \(\beta_1\) obtained from either method are identical. The estimates for the true parameter values \(\beta_0\) and \(\beta_1\) are typically denoted by \(b_0\) and \(b_1\), respectively, and are given by the following formulas.

Parameter Estimate Mathematical Formula R Code
Slope \(b_1 = \frac{\sum X_i(Y_i-\bar{Y})}{\sum(X_i-\bar{X})^2}\) b_1 <- sum( X*(Y - mean(Y)) ) / sum( (X - mean(X))^2 )
Intercept \(b_0 = \bar{Y} - b_1\bar{X}\) b_0 <- mean(Y) - b_1*mean(X)

It is important to note that these estimates are entirely determined from the observed data \(X\) and \(Y\). When the regression equation is written using the estimates instead of the parameters, we use the notation \(\hat{Y}\), which is the estimator of \(E\{Y\}\). Thus, we write \[\begin{equation} \hat{Y}_i = b_0 + b_1 X_i \end{equation}\] which is directly comparable to the true, but unknown values \[\begin{equation} E\{Y_i\} = \beta_0 + \beta_1 X_i. \label{exp} \end{equation}\]

7.1 Least Squares

To estimate the model parameters \(\beta_0\) and \(\beta_1\) using least squares, we start by defining the function \(Q\) as the sum of the squared errors, \(\epsilon_i\). \[ Q = \sum_{i=1}^n \epsilon_i^2 = \sum_{i=1}^n (Y_i - (\beta_0 + \beta_1 X_i))^2 \] Then we use the function Q as if it were a function of \(\beta_0\) and \(\beta_1\). Ironically, the values of \(Y\) and \(X\) are considered fixed. However, this makes sense because once a particular data set has been observed, these values are all known for that data set. What we don’t know are the values of \(\beta_0\) and \(\beta_1\).

This least squares applet is a good way to explore how various choices of the slope and intercept yield different values of the “sum of squared residuals”. But it turns out that there is one “best” choice of the slope and intercept that yields a “smallest” value of the “sum of squared residuals.” This best choice can actually be found using calculus by taking the partial derivatives of \(Q\) with respect to both \(\beta_0\) and \(\beta_1\). \[ \frac{\partial Q}{\partial \beta_0} = -2\sum (Y_i - \beta_0 - \beta_1X_i) \] \[ \frac{\partial Q}{\partial \beta_1} = -2\sum X_i(Y_i-\beta_0-\beta_1X_i) \] Setting these partial derivatives to zero, and solving the resulting system of equations provides the values of the parameters which minimize \(Q\) for a given set of data. After all the calculations are completed we find the values of the parameter estimators \(b_0\) and \(b_1\) (of \(\beta_0\) and \(\beta_1\), respectively) are as stated previously.

7.2 Maximum Likelihood

The idea of maximum likelihood estimation is opposite that of least squares. Instead of choosing those values of \(\beta_0\) and \(\beta_1\) which minime the least squares \(Q\) function, we choose the values of \(\beta_0\) and \(\beta_1\) which maximize the likelihood function. The likelihood function is created by first determining the joint distribution of the \(Y_i\) for all observations \(i=1,\ldots,n\). We can do this rather simply by using the assumption that the errors, \(\epsilon_i\) are independently normally distributed. When events are independent, their joint probability is simply the product of their individual probabilities. Thus, if \(f(Y_i)\) denotes the probability density function for \(Y_i\), then the joint probability density for all \(Y_i\), \(f(Y_1,\ldots,Y_n)\) is given by \[ f(Y_1,\ldots,Y_n) = \prod_{i=1}^n f(Y_i) \] Since each \(Y_i\) is assumed to be normally distributed with mean \(\beta_0 + \beta_1 X_i\) and variance \(\sigma^2\) (see model ()) we have that \[ f(Y_i) = \frac{1}{\sqrt{2\pi}\sigma}\exp{\left[-\frac{1}{2}\left(\frac{Y_i-\beta_0-\beta_1X_i}{\sigma}\right)^2\right]} \] which provides the joint probability as \[ f(Y_1,\ldots,Y_n) = \prod_{i=1}^n f(Y_i) = \frac{1}{(2\pi\sigma^2)^{n/2}}\exp{\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n(Y_i-\beta_0-\beta_1X_i)^2\right]} \] The likelihood function \(L\) is then given by consider the \(Y_i\) and \(X_i\) fixed and the parameters \(\beta_0\), \(\beta_1\) and \(\sigma^2\) as the variables in the function. \[ L(\beta_0,\beta_1,\sigma^2) = \frac{1}{(2\pi\sigma^2)^{n/2}}\exp{\left[-\frac{1}{2\sigma^2}\sum_{i=1}^n(Y_i-\beta_0-\beta_1X_i)^2\right]} \] Instead of taking partial derivatives of \(L\) directly (with respect to all parameters) we take the partial derivatives of the \(\log\) of \(L\), which is easier to work with. In a similar, but more difficult calculation, to that of minimizing \(Q\), we obtain the values of \(\beta_0\), \(\beta_1\), and \(\sigma^2\) which maximize the log of \(L\), and which therefore maximize \(L\). (This is not an obvious result, but can be verified after some intense calculations.) The additional result that maximimum likelihood estimation provides that the least squares estimates did not give us is the estimate \(\hat{\sigma}^2\) of \(\sigma^2\). \[ \hat{\sigma}^2 = \frac{\sum(Y_i-\hat{Y}_i)^2}{n} \]

7.3 Estimating the Model Variance

Estimating \(\sigma^2\) with MSE…

As shown previously in the “Estimating Model Parameters” section of this page, we can obtain estimates for the model parameters \(\beta_0\) and \(\beta_1\) by using either least squares estimation or maximum likelihood estimation. Those estimates were given by the formulas

\[ b_1 = \frac{\sum X_i(Y_i-\bar{Y})}{\sum(X_i-\bar{X})^2} \quad \text{(Unbiased Estimate of $\beta_1$)} \]

\[ b_0 = \bar{Y} - b_1\bar{X} \quad \text{(Unbiased Estimate of $\beta_0$)} \]

It turns out that these estimates for \(\beta_0\) and \(\beta_1\) are nice in the sense that on average they provide the correct estimate of the true parameter, i.e., they are unbiased estimators. Unfortunately, this is not the case for the maximum likelihood estimate \(\widehat{\sigma}^2\) of the model variance \(\sigma^2\). This estimate turns out to be a biased estimator. This means that it is consistently wrong in its estimates of \(\sigma^2\). If we left the estimator alone, our estimates for \(\sigma^2\) would always be wrong. This is bad. Fortunately, there is a way to fix it, and this corrected version of the estimator is what we will actually use in practice to estimate \(\sigma^2\).

Without going into all the details, to “fix” the biased estimator of \(\sigma^2\) that is given to us through maximum likelihood estimation, we need to correct its denominator so that it properly represent the degrees of freedom associated with the numerator, which it does not currently. To find the correct degrees of freedom, we have to notice that the \(\hat{Y}_i\) in the numerator of \(\widehat{\sigma}^2\) is defined by \[\begin{equation} \widehat{Y}_i = b_0 + b_1X_i \label{hatY} \end{equation}\] From this equation, we notice that two means, \(\bar{X}\) and \(\bar{Y}\), were estimated from the data in order to obtain \(\hat{Y}_i\). (See the formulas for \(b_0\) and \(b_1\) above, and note how they use both \(\bar{X}\) and \(\bar{Y}\) in their calculation.) Anytime a mean is estimated from the data we lose a degree of freedom. Hence, the denominator for \(\hat{\sigma}^2\) should be \(n-2\) instead of \(n\). Some incredibly long calculations will show that the “fixed” estimator \[\begin{equation} s^2 = MSE = \frac{\sum(Y_i-\hat{Y}_i)^2}{n-2} \quad \text{(Unbiased Estimator of $\sigma^2$)} \end{equation}\] is an unbiased estimator of \(\sigma^2\). Here \(MSE\) stands for mean squared error, which is the most obvious name for a formula that squares the errors \(Y_i-\hat{Y}_i\) then adds them up and divides by their degrees of freedom. Similarly, we call the numerator \(\sum(Y_i-\hat{Y}_i)^2\) the sum of the squared errors, denoted by \(SSE\). It is also important to note that the errors are often denoted by \(r_i = Y_i-\hat{Y}_i\), the residuals. Putting this all together we get the following equivalent statements for \(MSE\). \[\begin{equation} s^2 = MSE = \frac{SSE}{n-2} = \frac{\sum(Y_i-\widehat{Y}_i)^2}{n-2} = \frac{\sum r_i^2}{n-2} \end{equation}\] As a final note, even though the expected value \(E\{MSE\} = \sigma^2\), which shows \(MSE\) is an unbiased estimator of \(\sigma^2\), it unfortunately isn’t true that \(\sqrt{MSE}\) is an unbiased estimator of \(\sigma\). This presents a few problems later on, but these are minimal enough that we can overlook the issue and move forward. The \(\sqrt{MSE}\) is called the residual standard error.