14.6 Optional Reading: Analysis of Variance¶
14.6.1 Partitioning variance¶
The total variation in \(Y\) is equal to:
This is often referred to as the sum of squares of the \(Y\)’s. This represents all of the variation in \(Y\) about its overall mean value. We can think about two components of this total variation:
the predictable variation in \(Y\) (predicted by the variables included in the model) and
the unpredictable variation in \(Y\) (the remaining “noise”).
The predictable variation represents the variation of the predicted values \(\hat{Y}\) about the mean. We can measure this as:
The unpredictable variation represents the variation of the observed values about their predicted values:
Note that in the equations above, SS stands for sums of squares, REG for regression and RES for residual.
A key result for ANOVA is that the total variation in \(Y\) can be partitioned into the predictable variation, explained by the regression model, and the unpredictable (residual) variation:
14.6.1.1 The coefficient of determination¶
Using the sums of squares defined above, we can calculate the proportion of variance explained by the statistical model, known as the coefficient of determination.
The proportion of variation which is explained by a statistical model is denoted by \(R^2\) and is given by:
Example¶
The coefficient of determination is given in the summary()
output for a linear regression in R. In Model 1, \(R^2=0.1661\) (see output below). This means that Model 1 explains 16.6% of the total variation in \(Y\).
#The coefficient of determination
data<- read.csv('https://www.inferentialthinking.com/data/baby.csv')
model1<-lm(Birth.Weight~Gestational.Days, data=data)
summary(model1)
Call:
lm(formula = Birth.Weight ~ Gestational.Days, data = data)
Residuals:
Min 1Q Median 3Q Max
-49.348 -11.065 0.218 10.101 57.704
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.75414 8.53693 -1.26 0.208
Gestational.Days 0.46656 0.03054 15.28 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.74 on 1172 degrees of freedom
Multiple R-squared: 0.1661, Adjusted R-squared: 0.1654
F-statistic: 233.4 on 1 and 1172 DF, p-value: < 2.2e-16
While \(R^2\) is sometimes used an overall measure of goodness-of-fit (or predictive performance), it isn’t used to formally compare models. This is because \(R^2\) will never decrease when new covariates are added to a model (provided that the number and identity of observations remains the same). Therefore, using \(R^2\) for model comparisons, we would always conclude that the more complex model is at least as good a fit as the simpler model, even if this is not true.
An adjusted \(R^2\) has been proposed to account for this issue. Above, for example, the \(R^2\) is 0.166 but the adjusted R-squared is a little smaller \(R^2 = 0.165\). However, we would not recommend using the \(R^2\) for formal model comparison.
14.6.2 Comparing models¶
When fitting statistical models, we may wish to compare how well two models fit the data, to see which is most appropriate. Consider the following three models:
Model 1 (birthweight~length of pregnancy)
Model 2 (birthweight~mother’s smoking status)
Model 4 (birthweight~length of pregnancy+mothers height)
We might want to make the following comparisons between models:
Comparison 1: Model 1 vs Model 4
Comparison 2: Model 2 vs Model 4
In these examples, Comparison 1 is much simpler than Comparison 2, because the models in Comparison 1 are nested.
Statistical models are said to be nested when one model (the simpler model) contains a subset of the covariates in the other one (the complex model) and no other additional variables. In Comparison 2, the models are not nested because the simpler model (Model 2) contains mother’s smoking status as a variable, which is not included in Model 4.
Nested models can be compared using Analysis of Variance (ANOVA) (the comparison of non-nested models is much more complicated and is beyond the scope of this module).
The main idea of ANOVA is that: if the complex model better describes the data than the simpler model, then we would expect a reasonably large amount of the residual variation that is unexplained by the simpler model to be explained by the complex one. ANOVA provides a statistical framework that can formally test this.
We will first consider ANOVA in the context of simple linear regression, where the simpler model assumes no association between the outcome and the independent variable (the null model). We will then consider ANOVA in the context of multivariable linear regression and we end by learning how ANOVA can be used to test for differences between groups in a categorical variable.
14.6.3 The ANOVA table¶
Sums of squares (SS): The first step is to partition the total variation into the regression (predictable) and residual (unpredictable) components. Variation is measured by sums of squares (SS). So we partition the total sum of squares (\(SS_{TOT}\)) into (\(SS_{REG}\) and \(SS_{RES}\))
Degrees of freedom: Each of these sum of squares have an associated degrees of freedom (d.f.). The d.f. for the total sum of squares is \((n-1)\), since the variance of \(Y\) is \(\sum_{i=1}^n (Y_i-\bar{Y})^2/(n-1)\). The d.f. for the regression sum of squares in the number of covariates in the regression model (when a simple linear regression model is used this is equal to 1). The residual d.f. is found by subtracting the regression d.f. from the total d.f.
Mean squares (MS) The sums of squares also have associated mean squares, which are obtained by dividing each sum of squares by its associated degrees of freedom (note that the residual mean square is then equal to \(\hat{\sigma}^2\)).
These statistics are typically summarised in an ANOVA table. The table for simple linear regression is shown below.
Source |
d.f. |
SS |
Mean Square |
---|---|---|---|
Regression |
1 |
\(SS_{REG}\) |
\(MS_{REG}=\frac{SS_{REG}}{1}\) |
Residual |
n-2 |
\(SS_{RES}\) |
\(MS_{RES}=\frac{SS_{RES}}{n-2}\) |
Total |
n-1 |
\(SS_{TOT}\) |
\(MS_{TOT}=\frac{SS_{TOT}}{n-1}\) |
Notes
Very loosely speaking, degrees of freedom are “bits of information”. We start with \(n\) bits of information. Every time we estimate something we “use” a bit of information (and so lose a degree of freedom. Therefore, when we calculate the overall variation in \(Y\), we lose one of the \(n\) bits of information because we need to calculate the overall mean to obtain the sum of squares of \(Y\). Therefore, we have \(n-1\) bits of information overall. In a simple linear regression model, we estimate two parameters, so we’re using two bits of information, but one of these is essentially the same bit we lost from calculating the overall mean, so we say that the regression model is using 1 degree of freedom. We started with n-1 df and the regression model used 1 of them so there are n-2 left for the remaining component, the residual SS.
14.6.3.1 Hypothesis testing using ANOVA¶
The values in the ANOVA table can be used to conduct formal hypothesis tests.
ANOVA is used to test the null hypothesis that the simpler of the two nested models better fits the data. In simple linear regression, the simpler model is the null model, in which case:
\(H_0:\) The null model is a better fit
\(H_1:\) The simple linear regression model is a better fit
To test the null hypothesis defined above, we use an \(F\) statistic, defined as:
This ratio measures how much more variation in \(Y\) is explained by the model than would be expected by chance. If the model does not fit the data well, then we would expect this ratio to be equal to 1. The larger the value of \(F\), the stronger the evidence that the complex model is a better fit. To obtain a \(p\)-value for a formal hypothesis test, \(F\) can be compared to the \(F_{1,(n-2)}\) distribution (where 1 and (n-2) are the relevant degrees of freedom for the mean squares).
# F-test using anova()
anova(model1)
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
Gestational.Days | 1 | 65449.51 | 65449.5131 | 233.4293 | 3.395226e-48 |
Residuals | 1172 | 328608.34 | 280.3825 | NA | NA |
The \(F\)-statistic is equal to 233.4 with \(p\)-value \(<2.2\times10^{-16}\). With such a small \(p\)-value there is strong evidence against the null hypothesis. Therefore we conclude that the model which includes gesational days is a better fit.
14.6.3.2 Connection between F tests and t-tests in simple linear regression¶
Above, we used a \(F\)-test to compare the model for birthweight (Y) including gestational days (L):
with a model including just a constant
In other words, we have just used a \(F\)-test to test the null hypothesis \(H_0: \beta_1 = 0\). This is exactly the same hypothesis test we tested previously using a \(t\)-test.
In other words, the \(F\)-test for a simple linear regression model is the same as the \(t\)-test of the null hypothesis that the slope parameter is equal to 0. Below, we perform the \(t\)-test again.
# F-test
summary(model1)
Call:
lm(formula = Birth.Weight ~ Gestational.Days, data = data)
Residuals:
Min 1Q Median 3Q Max
-49.348 -11.065 0.218 10.101 57.704
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -10.75414 8.53693 -1.26 0.208
Gestational.Days 0.46656 0.03054 15.28 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.74 on 1172 degrees of freedom
Multiple R-squared: 0.1661, Adjusted R-squared: 0.1654
F-statistic: 233.4 on 1 and 1172 DF, p-value: < 2.2e-16
We see that the t-statistic for the slope is \(t=15.28\), with p-value \(p<2e-16\). Previously, we had \(F=233.4=15.24^2=t^2\). The p-value for the \(F\)-test was identical to the \(t\)-test.
The two tests are equivalent, with \(F=t^2\), providing identical p-values. Consequently, it is not particularly common to use \(F\)-tests in the simple linear regression model, they are more useful for assessing more complex models with multiple covariates.
14.6.4 ANOVA for multivariable linear regression¶
In the context of multivariable linear regression, ANOVA can be used to test whether a more complex model is a better fit than the null model (the Global F test), or whether a more complex model is a better fit than a simpler model that includes a subset of the covariates in the complex model (the partial F test). Each test requires slight modifications to the ANOVA table defined above and we will discuss these in turn.
14.6.4.1 The Global F test¶
The general formulation of the ANOVA table (suitable for simple and multivariable linear regression models) is given below. \(p\) is the number of covariates in the model.
Source |
d.f. |
SS |
Mean Square |
---|---|---|---|
Regression |
\(p\) |
\(SS_{REG}\) |
\(MS_{REG}=\frac{SS_{REG}}{p}\) |
Residual |
\(n-(p+1)\) |
\(SS_{RES}\) |
\(MS_{RES}=\frac{SS_{RES}}{n-p-1}\) |
Total |
\(n-1\) |
\(SS_{TOT}\) |
\(MS_{TOT}=\frac{SS_{TOT}}{n-1}\) |
Note that this is equivalent to the previous table (for simple linear regression) when \(p=1\).
The Global F test tests the null hypothesis (\(H_0\)) that the null model is a better fit than the more complex model against the alternative hypothesis (\(H_1\)) that the complex model is a better fit. Or, equivalently:
\(H_0:\) All slope parameters in the complex model are equal to 0.
\(H_1:\) At least one of the slope parameters in the complex model is not equal to 0.
The appropriate \(F\) statistic is the ratio
Under the null hypothesis, \(F\) follows an \(F_{p,(n-(p+1))}\) distribution.
Example¶
We can use summary()
to conduct a global \(F\)-test for Model 4, our model relating birthweight to both length of pregnancy and maternal height.
The null and alternative hypotheses, for the global \(F\)-test, are defined as:
\(H_0\): the regression coefficients for both gestational days and mother’s height are equal to 0.
\(H_1\): the regression coefficient for either gestational days or mother’s height (or both) is not equal to 0.
# ANOVA for Model 4
data$Gestational.Days.Centered<-data$Gestational.Days-mean(data$Gestational.Days)
data$Maternal.Height.Centered<-data$Maternal.Height-mean(data$Maternal.Height)
model4<-lm(Birth.Weight~Gestational.Days.Centered+Maternal.Height.Centered, data=data)
summary(model4)
Call:
lm(formula = Birth.Weight ~ Gestational.Days.Centered + Maternal.Height.Centered,
data = data)
Residuals:
Min 1Q Median 3Q Max
-53.829 -10.589 0.246 10.254 54.403
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 119.46252 0.47980 248.983 < 2e-16 ***
Gestational.Days.Centered 0.45237 0.03006 15.051 < 2e-16 ***
Maternal.Height.Centered 1.27598 0.19049 6.698 3.27e-11 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 16.44 on 1171 degrees of freedom
Multiple R-squared: 0.1969, Adjusted R-squared: 0.1955
F-statistic: 143.5 on 2 and 1171 DF, p-value: < 2.2e-16
The \(F\) statistic is 143.5 with a \(p\)-value \(<2.2 \times 10^{-16}\). Therefore, there is strong evidence against the null and we can conclude that at least one of the estimated regression coefficients is non-zero (i.e Model 4 is a better fit than the null model).
14.6.4.2 The partial F test¶
The global \(F\)-test is a joint test of the statistical signficance of all the slope parameters in a linear regression model. On the other hand, the partial \(F\)-test compares the fit of:
Model A (complex model, with \(p\) predictors)
Model B (simpler nested model with \(p-k\) predictors).
The key to the partial \(F\)-test is the construction of an Analysis of Variance table that partitions the sum of squares explained by the complex model into that explained by the simple model and the extra sum of squares only explained by the complex model. Using the notation that \(SS_{REG_A}\) denotes the sum of squares explained by the complex model, whilst \(SS_{REG_B}\) denotes the sum of squares explained by the simpler model, the ANOVA table is as shown below.
Source |
d.f. |
SS |
Mean Square |
---|---|---|---|
Explained by Model B |
\(p-k\) |
\(SS_{REG_B}\) |
\(MS_{REG_B}=\frac{SS_{REG_B}}{p-k}\) |
Explained by Model A |
\(p\) |
\(SS_{REG_A}\) |
\(MS_{REG_A}=\frac{SS_{REG_A}}{p}\) |
Extra explained by Model A |
\(k\) |
\(SS_{REG_A}-SS_{REG_B}\) |
\(MS_{REG_X}=\frac{(SS_{REG_A}-SS_{REG_B})}{k}\) |
Residual from Model A |
\(n-(p+1)\) |
\(SS_{RES_A}\) |
\(MS_{RES}=\frac{SS_{RES_A}}{n-(p+1)}\) |
Total |
\((n-1)\) |
\(SS_{TOT}\) |
\(MS_{TOT}=\frac{SS_{TOT}}{n-1}\) |
The partial \(F\)-test tests the following null hypothesis:
\(H_0: \) all of the slope parameters included in Model A but omitted from Model B are equal to zero.
\(H_1: \) at least one of the additional parameters in Model A is not equal to 0.
The appropriate test statistic (\(F\)) is the ratio of extra mean sum of squares in Model A to the mean residual sum of squares from Model A. Under the null hypothesis, this test statistic follows an \(F\)-distribution:
Example: We can use anova()
to conduct a partial F-test to compare Models 1 and 3:
\(H_0:\) Model 1 is the better fit
\(H_0:\) Model 3 is the better fit
Example¶
We can use anova()
to conduct a partial \(F\)-test to compare Models 1 and 4:
Model 1 (birthweight~length of pregnancy)
Model 4 (birthweight~length of pregnancy+mothers height)
Model 1 is nested within Model 4. Model 4 is our complex model.
In this case, the two models only differ by one variable (mother’s height) and so the hypotheses being tested within the partial \(F\)-test could be written as:
\(H_0: \beta_2=0\), where \(\beta_2\) is the regression coefficient for mother’s height.
\(H_0: \beta_2 \neq 0\).
anova(model1, model4)
Res.Df | RSS | Df | Sum of Sq | F | Pr(>F) |
---|---|---|---|---|---|
1172 | 328608.3 | NA | NA | NA | NA |
1171 | 316482.2 | 1 | 12126.13 | 44.86728 | 3.266475e-11 |
The \(F\)-statistic is 44.87 with a \(p\)-value of \(3.23\times 10^{-11}\). This is strong evidence against the null hypothesis. Hence the data indicates that Model 4 (the more complex model) is the better fit.
When the two models being compared only differ by one variable, the partial F test is equivalent to the t test of the null hypothesis that the regression coefficient for that variable is equal to 0. Notice in our example that the results of the partial F test are the same as the t-test for \(\beta_2=0\), with \(F=t^2\).
For this reason, partial F tests are more useful in situations where we wish to compare models that differ by more than one variable. The approach is identical to that shown above.
14.6.5 ANOVA for models with categorical independent variables¶
Another useful application of ANOVA is to test for differences in means between categories of a categorical variable.
Suppose we are interested in the association between an outcome \(Y\) and a categorical variable \(X\) with \(K\) groups. We have already seen how to define a multivariable linear regression model using dummy variables for this situation. An alternative model, often termed the ANOVA model, is as follows:
Let \(y_{ki}\) be the value of the outcome for the \(i^{th}\) observation in the \(k^{th}\) group (\(i=1,...,n_k\) and \(k=1,...,K\)). The ANOVA model is then defined as:
Here, \(\mu_k\) is the mean of the outcome in the \(k^{th}\) group. With this representation, the null and alternative hypothesis are:
\(H_0: \mu_k= \mu\) (i.e. the means in all groups defined by the categorical variables are equal to a common value).
\(H_1: \mu_k \neq \mu\) (i.e. the group means are not all equal).
14.6.5.1 Sum of squares for models with categorical variables¶
For models with a single independent categorical variable the fitted values are simply the group means (\(\bar{y_k}\)). Under the null hypothesis that the group means are all equal, the fitted values are all equal to the overall mean (\(\bar{y}\)). This leads to new terminology for the residual sum of squares (\(SS_{RES}\)) and the sum of squares explained by the model (\(SS_{REG}\)):
\(SS_{RES} = \sum_{k=1}^K \sum_{i=1}^{n_k} (y_{ki} - \bar{y}_k)^2\)
\(SS_{REG} = \sum_{k=1}^K \sum_{i=1}^{n_k} (\bar{y}_k - \bar{y})^2 = \sum_{k=1}^K n_k (\bar{y}_k - \bar{y})^2 \)
In this case, the residual sum of squares is often termed the within group sum of squares \((SS_{Within})\) and the regression sum of squares is often termed the between group sum of squares \((SS_{Between})\).
14.6.5.2 The ANOVA table¶
When there are \(K\) groups, the degrees of freedom for the within groups sum of squares is \(n-K\) (because the model includes \(K\) parameters) and the degrees of freedom for the between groups sum of squares is \(K-1\) (because the null model contains a single parameter, the overall mean). Hence the ANOVA table is as follows:
Source |
d.f. |
SS |
Mean Square |
---|---|---|---|
Between groups |
\(K-1\) |
\(SS_{Between}\) |
\(MS_{Between}=\frac{SS_{Between}}{(K-1)}\) |
WIthin Groups |
\(n-K\) |
\(SS_{Within}\) |
\(MS_{Within}=\frac{SS_{RES}}{n-K}\) |
Total |
\(n-1\) |
\(SS_{TOT}\) |
\(MS_{TOT}=\frac{SS_{TOT}}{n-1}\) |
14.6.5.3 The F-test¶
To test the null hypothesis that the means in all groups are equal to a common value, the appropriate \(F\)-statistic is:
If this test obtains a small \(p\)-value, then we have evidence that the means in the groups are not all the same. However, it does not tell us which of the group means differed from which other group means. For this reason, if we do find evidence of difference in means on an \(F\)-test, we may want to follow up with further analysis. Such further analysis may include pair-wise comparisons of means through analysis restricted to two groups.
Example¶
We conduct an \(F\)-test to compare the average birthweights between babies whose mothers smoke and whose mothers don’t smoke using the birthweight data.
Let \(\mu_1\) and \(\mu_0\) denote the mean birthweight for babies whose mothers do smoke and don’t smoke, respectively. Then, the relevant hypotheses are:
\(H_0: \mu_1= \mu_0\) (i.e. the birthweight of a baby does not depend on whether the mother smoked)
\(H_1: \mu_1\neq \mu_0\)
Recall that we previously defined Model 2 to related birthweight and mother’s smoking status:
Where \(Y\) denotes the birthweight and
We can rewrite this equation using the ANOVA model as follows:
Where \(y_{ki}\) is the mean birthweight in the \(k^{th}\) group (groups are defined by mother’s smoking status), \(\mu_1 = \beta_0 + \beta_1\) and \(\mu_0=\beta_0\) (in other words, our null hypothesis can be rewritten as: \(\beta_1=0\)).
We can use either anova()
or summary()
to conduct the test in R:
model2<-lm(Birth.Weight~factor(Maternal.Smoker), data=data)
anova(model2)
summary(model2)
Df | Sum Sq | Mean Sq | F value | Pr(>F) | |
---|---|---|---|---|---|
factor(Maternal.Smoker) | 1 | 24002.06 | 24002.0638 | 76.0167 | 9.461068e-18 |
Residuals | 1172 | 370055.79 | 315.7473 | NA | NA |
Call:
lm(formula = Birth.Weight ~ factor(Maternal.Smoker), data = data)
Residuals:
Min 1Q Median 3Q Max
-68.085 -11.085 0.915 11.181 52.915
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 123.0853 0.6645 185.221 <2e-16 ***
factor(Maternal.Smoker)True -9.2661 1.0628 -8.719 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 17.77 on 1172 degrees of freedom
Multiple R-squared: 0.06091, Adjusted R-squared: 0.06011
F-statistic: 76.02 on 1 and 1172 DF, p-value: < 2.2e-16
In the ANOVA table, the factor(Maternal.Smoker)
row gives the between groups results and the Residuals
row gives the within group results.
The \(F\)-statistic is equal to 76.02 with a \(p\)-value equal to \(9.46\times10^{-18}\). This evidence suggests that there is a difference in the mean birthweight between the two groups defined by mother’s smoking status.