14.1 Assumptions¶
The linear regression model makes a number of assumptions. All inferences made from a model are contingent on these assumptions being correct. It is therefore important that we have statistical techniques (or diagnostic tools) to investigate these assumptions.
In practice, it is rare for all the assumptions of a statistical procedure to hold exactly. We may have evidence in the data, or prior knowledge about the data, that lead us to believe that the assumptions made by the model do not hold. This does not necessarily mean that the results from the model should be disregarded, since statistical procedures are robust to departures from assumptions in many settings. When conducting statistical analyses, it is a good idea to first try to establish to what extent assumptions hold and then consider whether the methods used can be adapted to improve the extent to which assumptions hold. If adaptations cannot be made, it is necessary to consider to what extent the results of an analysis can be trusted.
In this section we largely focus on diagnostic tools that can be used to identify assumption violations. Some pointers are given to possible adaptations and alternative techniques that can be used when assumptions are violated, however issues of robustness are not considered in great detail. It is worth noting that, broadly speaking, the central limit theorem implies that departures from assumptions are less important for large datasets than for small ones, and so assumption violations are less of a concern when working with big data.
14.1.1 Assumptions of the linear regression model¶
The assumptions made by the linear regression model are as follows:
Linearity: There is a linear relationship between the dependent variable \(Y\) and each of the independent variables. Here we are contrasting a linear relationship with a non-linear relationship, not with no relationship. A model in which one of the regression coefficients is zero can satisfy the assumptions of linear regression.
Normality: The error terms follow a normal distribution.
Homoscedasticity: The error variance is constant i.e. the scatter of points around the true regression line has the same variance, irrespective of the value of \(x_i\). The converse of this feature is termed heteroscedasticity.
Independence: The observations of \(y_i\) are independent.
In this session we will focus on the first three assumptions. Violations of the independence assumption are often more apparant from the context of a study than from the data itself. For example, if we carry out a study in which the blood pressure of 100 people are each measured twice, and then treat the 200 measurements as independent in the statistical analysis it is clear that the assumption of independence is violated.
Notice that the normality and homoscedasticity assumptions concern the error terms, which can be thought of as the true residuals, defined in terms of deviations from the model defined by population parameters. Since these errors or true residuals can never be observed in practice, we have to use the observed residuals (obtained by replacing the population parameters with their estimates). In fact, observed residuals are neither independent nor do they have constant variance, but in most settings the departures from independence and homoscedasticity are very small. Consequently, we can proceed as if the observed residuals were the true residuals when investigating assumptions.