5 Assessing the Fit of a Regression

$R^2$, SSTO, SSR, and SSE…

Not all regressions are created equally as the three plots below show. Sometimes the dots are a clustered very tightly to the line. At other times, the dots spread out fairly dramatically from the line.

A common way to measure the fit of a regression is with correlation. While this can be a useful measurement, there is greater insight in using the square of the correlation, called $R^2$. Before you can understand $R^2$, you must understand three important “sums of squares”.

(Read more about sums…)

Individual	speed	dist
1	4	2
2	4	10
3	7	4
4	7	22
5	8	16
6	9	10

Sum of Squared Errors	Sum of Squares Regression	Total Sum of Squares
$\text{SSE} = \sum_{i=1}^n \left(Y_i - \hat{Y}_i\right)^2$	$\text{SSR} = \sum_{i=1}^n \left(\hat{Y}_i - \bar{Y}\right)^2$	$\text{SSTO} = \sum_{i=1}^n \left(Y_i - \bar{Y}\right)^2$
Measures how much the residuals deviate from the line.	Measures how much the regression line deviates from the average y-value.	Measures how much the y-values deviate from the average y-value.
Equals SSTO - SSR	Equals SSTO - SSE	Equals SSE + SSR
`sum( (Y - mylm$fit)^2 )`	`sum( (mylm$fit - mean(Y))^2 )`	`sum( (Y - mean(Y))^2 )`

It is important to remember that SSE and SSR split up SSTO, so that \[ \text{SSTO} = \text{SSE} + \text{SSR} \] This implies that if SSE is large (close to SSTO) then SSR is small (close to zero) and visa versa. The following three graphics demonstrate how this works.

The above graphs reveal that the idea of correlation is tightly linked with sums of squares. In fact, the correlation squared is equal to SSR/SSTO. And this fraction, SSR/SSTO is called $R^2$ (“r-squared”).

R-Squared ($R^2$) \[ \underbrace{R^2 = \frac{SSR}{SSTO} = 1 - \frac{SSE}{SSTO}}_\text{Interpretation: Proportion of variation in Y explained by the regression.} \]

The smallest $R^2$ can be is zero, and the largest it can be is 1. This is because $SSR$ must be between 0 and SSTO, inclusive.

Sum of Squared Errors	Sum of Squares Regression	Total Sum of Squares
\(\text{SSE} = \sum_{i=1}^n \left(Y_i - \hat{Y}_i\right)^2\)	\(\text{SSR} = \sum_{i=1}^n \left(\hat{Y}_i - \bar{Y}\right)^2\)	\(\text{SSTO} = \sum_{i=1}^n \left(Y_i - \bar{Y}\right)^2\)
Measures how much the residuals deviate from the line.	Measures how much the regression line deviates from the average y-value.	Measures how much the y-values deviate from the average y-value.
Equals SSTO - SSR	Equals SSTO - SSE	Equals SSE + SSR
`sum( (Y - mylm$fit)^2 )`	`sum( (mylm$fit - mean(Y))^2 )`	`sum( (Y - mean(Y))^2 )`