9 Inference for the Model Parameters
We are sometimes interested in making inference about \(\beta_0\), the y-intercept. However, most inference in regression is focused on the slope, \(\beta_1\). Recall that the interpretation of \(\beta_1\) is the amount of increase (or decrease) in the expected value (average value) of \(Y\) per unit change in \(X\).
Two types of inference about \(\beta_1\), or similarly \(\beta_0\) when applicable, are of interest.
Hypotheses | Test Statistic | P-value |
---|---|---|
\(H_0: \beta_0 =\) \(\underbrace{0}_\text{a number}\) This could be any number, not just 0. However, the default summar(mylm) output in R only shows the test statistic and p-value for the test that uses 0. To test a different value, you would need to compute the test statistic and p-value by hand using the formula shown. \(H_a: \beta_0\) \(\,\neq\,\) You could use \(>\) or \(<\) instead of \(\neq\) for the alternative hypothesis. By default, the p-value from summary(mylm) in R uses \(\neq\). \(\underbrace{0}_\text{a number}\) This could be any number, not just 0. However, the default summar(mylm) output in R only shows the test statistic and p-value for the test that uses 0. To test a different value, you would need to compute the test statistic and p-value by hand using the formula shown. |
\[t = \frac{b_0 - \overbrace{0}^\text{a number}}{s_{b_0}}\] This is the formula for the test statistic. It measures how far the estimated y-intercept \(b_0\) is from the null hypothesis for \(\beta_0\) in units of “standard errors of \(b_0\)”. Thus the division by \(s_{b_0}\). Though the hypothesized value of \(\beta_0\) is typically 0, it could be any number. |
|
\(H_0: \beta_1 =\) \(\underbrace{0}_\text{a number}\) This could be any number, not just 0. However, the default summar(mylm) output in R only shows the test statistic and p-value for the test that uses 0. To test a different value, you would need to compute the test statistic and p-value by hand using the formula shown. \(H_a: \beta_1\) \(\,\neq\,\) You could use \(>\) or \(<\) instead of \(\neq\) for the alternative hypothesis. By default, the p-value from summary(mylm) in R uses \(\neq\). \(\underbrace{0}_\text{a number}\) This could be any number, not just 0. However, the default summar(mylm) output in R only shows the test statistic and p-value for the test that uses 0. To test a different value, you would need to compute the test statistic and p-value by hand using the formula shown. |
\[t = \frac{b_1 - \overbrace{0}^\text{a number}}{s_{b_1}}\] This is the formula for the test statistic. It measures how far the estimated slope \(b_1\) is from the null hypothesis for \(\beta_1\) in units of “standard errors of \(b_1\)”. Thus the division by \(s_{b_1}\). Though the hypothesized value of \(\beta_1\) is typically 0, it could be any number. |
Left-tailed p-value = |
Confidence Interval | Formula | Standard Error |
---|---|---|
\(\beta_0\) | \(b_0 \pm\) \(t^*\) This is called the “critical value” and denotes the number of standard deviations that are needed to obtain a 95% confidence interval from a t distribution with degrees of freedom \(n-p\) (sample size - number of parameters in the regression model). \(\cdot\) The critical value is multiplied by the standard error of \(b_0\). \(s_{b_0}\) The standard error of \(b_0\), denoted by \(s_{b_0}\) is provided in the regression summary output under the column header called “Std. Error” for the “(Intercept)” row of the output. It is calculated using the formula shown below. | \[s^2_{b_0} = MSE\left[\frac{1}{n} + \frac{\bar{X}^2}{\sum(X_i-\bar{X})^2}\right]\] This is called the “estimated variance of \(b_0\)”. Taking the square root of this number gives the “standard error of \(b_0\)”. |
\(\beta_1\) | \(b_1 \pm\) \(t^*\) This is called the “critical value” and denotes the number of standard deviations that are needed to obtain a 95% confidence interval from a t distribution with degrees of freedom \(n-p\) (sample size - number of parameters in the regression model). \(\cdot\) The critical value is multiplied by the standard error of \(b_1\). \(s_{b_1}\) The standard error of \(b_1\), denoted by \(s_{b_1}\) is provided in the regression summary output under the column header called “Std. Error”. It is calculated using the formula shown below. | \[s^2_{b_1} = \frac{MSE}{\sum(X_i-\bar{X})^2}\] This is called the “estimated variance of \(b_1\)”. Taking the square root of this number gives the “standard error of \(b_1\)”. |
To be more exact, the types of inference we are interested in are the following.
Determine if there is evidence of a meaningful linear relationship in the data. If \(\beta_1 = 0\), then there is no relation between \(X\) and \(E\{Y\}\). Hence we might be interested in testing the hypotheses \[ H_0: \beta_1 = 0 \] \[ H_a: \beta_1 \neq 0 \]
Determine if the slope is greater, less than, or different from some other hypothesized value. In this case, we would be interested in using hypotheses of the form \[ H_0: \beta_1 = \beta_{10} \] \[ H_a: \beta_1 \neq \beta_{10} \] where \(\beta_{10}\) is some hypothesized number.
To provide a confidence interval for the true value of \(\beta_1\).
Before we discuss how to test the hypotheses listed above or construct a confidence interval, we must understand the sampling distribution of the estimate \(b_1\) of the parameter \(\beta_1\). And, while we are at it, we may as well come to understand the sampling distribution of the estimate \(b_0\) of the parameter \(\beta_0\).
Review sampling distributions from Math 221.
Since \(b_1\) is an estimate, it will vary from sample to sample, even though the truth, \(\beta_1\), remains fixed. (The same holds for \(b_0\) and \(\beta_0\).) It turns out that the sampling distribution of \(b_1\) (where the \(X\) values remain fixed from study to study) is normal with mean and variance: \[ \mu_{b_1} = \beta_1 \] \[ \sigma^2_{b_1} = \frac{\sigma^2}{\sum(X_i-\bar{X})^2} \]
## Simulation to Show relationship between Standard Errors
##-----------------------------------------------
## Edit anything in this area...
n <- 100 #sample size
Xstart <- 30 #lower-bound for x-axis
Xstop <- 100 #upper-bound for x-axis
beta_0 <- 2 #choice of true y-intercept
beta_1 <- 3.5 #choice of true slope
sigma <- 13.8 #choice of st. deviation of error terms
## End of Editable area.
##-----------------------------------------------
# Create X, which will be used in the next R-chunk.
X <- rep(seq(Xstart,Xstop, length.out=n/2), each=2)
## After playing this chunk, play the next chunk as well.
To see that this is true, consider the regression model with values specified for each parameter as follows.
\[ Y_i = \overbrace{\beta_0}^{2} + \overbrace{\beta_1}^{3.5} X_i + \epsilon_i \quad \text{where} \ \epsilon_i \sim N(0, \overbrace{\sigma^2}^{\sigma=13.8}) \]
Using the equations above for \(\mu_{b_1}\) and \(\sigma^2_{b_1}\) we obtain that the mean of the sampling distribution of \(b_1\) will be
\(\mu_{b_1} = \beta_1 = 3.5\)
Further, we see that the variance of the sampling distribution of \(b_1\) will be
\(\sigma^2_{b_1} = \frac{\sigma^2}{\sum(X_i-\bar{X})^2} = \frac{13.8^2}{4.25\times 10^{4}}\)
Taking the square root of the variance, the standard deviation of the sampling distribution of \(b_1\) will be
\(\sigma_{b_1} = 0.067\).
That’s very nice. But to really believe it, let’s run a simulation ourselves. The “Code” below is worth studying. It runs a simulation that (1) takes a sample of data from the true regression relation, (2) fits the sampled data with an estimated regression equation (gray lines in the plot), and (3) computes the estimated values of \(b_1\) and \(b_0\) for that regression.
After doing this many, many times, the results of every single regression are plotted (in gray lines, which creates a gray shaded region because there are so many lines) in the scatterplot below. Further, each obtained estimate of \(b_0\) is plotted in the histogram on the left (below the scatterplot) and each obtained estimate of \(b_1\) is plotted in the histogram on the right. Looking at the histograms carefully, it can be seen that the mean of each histogram is very close to the true parameter value of \(\beta_0\) or \(\beta_1\), respectively. Also, the “Std. Error” of each histogram is incredibly close (if not exact to 3 decimal places) to the computed value of \(\sigma_{b_0}\) and \(\sigma_{b_1}\), respectively. Amazing!
9.1 t Tests
Using the information above about the sampling distributions of \(b_1\) and \(b_0\), an immediate choice of statistical test to test the hypotheses \[ H_0: \beta_1 = \beta_{10} \] \[ H_a: \beta_1 \neq \beta_{10} \] where \(\beta_{10}\) can be zero, or any other value, is a t test given by \[ t = \frac{b_1 - \beta_{10}}{s_{b_1}} \] where \(s^2_{b_1} = \frac{MSE}{\sum(X_i-\bar{X})^2}\). (You may want to review the section “Estimating the Model Variance” of this file to know where MSE came from.) With quite a bit of work it has been shown that \(t\) is distributed as a \(t\) distribution with \(n-2\) degrees of freedom. The nearly identical test statistic for testing \[ H_0: \beta_0 = \beta_{00} \] \[ H_a: \beta_0 \neq \beta_{00} \] is given by \[ t = \frac{b_0 - \beta_{00}}{s_{b_0}} \] where \(s^2_{b_0} = MSE\left[\frac{1}{n}+\frac{\bar{X}^2}{\sum(X_i-\bar{X})^2}\right]\). This version of \(t\) has also been shown to be distributed as a \(t\) distribution with \(n-2\) degrees of freedom.
9.2 Confidence Intervals
Creating a confidence interval for either \(\beta_1\) or \(\beta_0\) follows immediately from these results using the formulas \[ b_1 \pm t^*_{n-2}\cdot s_{b_1} \] \[ b_0 \pm t^*_{n-2}\cdot s_{b_0} \] where \(t^*_{n-2}\) is the critical value from a t distribution with \(n-2\) degrees of freedom corresponding to the chosen confidence level.
9.3 F tests
Another way to test the hypotheses \[ H_0: \beta_1 = \beta_{10} \quad\quad \text{or} \quad\quad H_0: \beta_0 = \beta_{00} \] \[ H_a: \beta_1 \neq \beta_{10} \quad\quad \ \ \quad \quad H_a: \beta_0 \neq \beta_{00} \] is with an \(F\) Test. One downside of the F test is that we cannot construct confidence intervals. Another is that we can only perform two-sided tests, we cannot use one-sided alternatives with an F test. The upside is that an \(F\) test is very general and can be used in many places that a t test cannot.
In its most general form, the \(F\) test partitions the sums of squared errors into different pieces and compares the pieces to see what is accounting for the most variation in the data. To test the hypothesis that \(H_0:\beta_1=0\) against the alternative that \(H_a: \beta_1\neq 0\), we are essentially comparing two models against each other. If \(\beta_1=0\), then the corresponding model would be \(E\{Y_i\} = \beta_0\). If \(\beta_1\neq0\), then the model remains \(E\{Y_i\}=\beta_0+\beta_1X_i\). We call the model corresponding to the null hypothesis the reduced model because it will always have fewer parameters than the model corresponding to the alternative hypothesis (which we call the full model). This is the first requirement of the \(F\) Test, that the null model (reduced model) have fewer “free” parameters than the alternative model (full model). To demonstrate what we mean by “free” parameters, consider the following example.
Say we wanted to test the hypothesis that \(H_0:\beta_1 = 2.5\) against the alternative that \(\beta_1\neq2.5\). Then the null, or reduced model, would be \(E\{Y_i\}=\beta_0+2.5X_i\). The alternative, or full model, would be \(E\{Y_i\}=\beta_0+\beta_1X_i\). Thus, the null (reduced) model contains only one “free” parameter because \(\beta_1\) has been fixed to be 2.5 and is no longer free to be estimated from the data. The alternative (full) model contains two “free” parameters, both are to be estimated from the data. The null (reduced) model must contain fewer free parameters than the alternative (full) model.
Once the null and alternative models have been specified, the General Linear Test is performed by appropriately partitioning the squared errors into pieces corresponding to each model. In the first example where we were testing \(H_0: \beta_1=0\) against \(H_a:\beta_1\neq0\) we have the partition \[ \underbrace{Y_i-\bar{Y}}_{Total} = \underbrace{\hat{Y}_i - \bar{Y}}_{Regression} + \underbrace{Y_i-\hat{Y}_i}_{Error} \] The reason we use \(\bar{Y}\) for the null model is that \(\bar{Y}\) is the unbiased estimator of \(\beta_0\) for the null model, \(E\{Y_i\} = \beta_0\). Thus we would compute the following sums of squares: \[ SSTO = \sum(Y_i-\bar{Y})^2 \] \[ SSR = \sum(\hat{Y}_i-\bar{Y})^2 \] \[ SSE = \sum(Y_i-\hat{Y}_i)^2 \] and note that \(SSTO = SSR + SSE\). Important to note is that \(SSTO\) uses the difference between the observations \(Y_i\) and the null (reduced) model. The \(SSR\) uses the diffences between the alternative (full) and null (reduced) model. The \(SSE\) uses the differences between the observations \(Y_i\) and the alternative (full) model. From these we could set up a General \(F\) table of the form
Sum Sq | Df | Mean Sq | F Value | |
---|---|---|---|---|
Model Error | \(SSR\) | \(df_R-df_F\) | \(\frac{SSR}{df_R-df_F}\) | \(\frac{SSR}{df_R-df_F}\cdot\frac{df_F}{SSE}\) |
Residual Error | \(SSE\) | \(df_F\) | \(\frac{SSE}{df_F}\) | |
Total Error | \(SSTO\) | \(df_R\) |