14.5 Collinearity

A potential issue that can arise from including higher-order terms or interaction terms in a linear regression model is collinearity. Collinearity occurs when there is correlation between one or more of the independent covariates. If the degree of correlation between covariates is high enough, it can cause the following problems:

  1. Estimated coefficients can swing in either direction: known important variables may have surprisingly small coefficient estimates and known less important variables may have surprisingly large coefficient estimates.

  2. Increased variance of estimated coefficients, therefore reducing the statistical power of the model.

Including higher-order terms or interaction terms can result in collinearity due to the inevitable correlation between variables, their powers and the interaction terms involving them. Having said that, collinearity is only a concern in particular situations.

Collinearity only affects the specific independent variables that are correlated. Therefore, if the aim of the analysis is to estimate the way \(X\) influences \(Y\) after adjusting for \(W\) and \(V\), and there is only correlation between \(W\) and \(V\), then collinearity is not a concern. Moreover, collinearity rarely affects the predicted outcomes, so if the aim of the analysis is to predict \(Y\) using data from \(X\), \(W\) and \(V\), then collinearity between any of the covariates is not a concern. Finally, the severity of the problems caused by collinearity increases with the degree of correlation. Therefore, if only moderate or weak correlation is present, then collinearity is not a concern.

If we are in a situation where collinearity is causing problems, then we could either remove some of the highly correlated variables, or transform one of them. Examples of transformations include:

  1. Instead of using systolic and diastolic blood pressure as collinear predictor variables, use diastolic blood pressure and (systolic-diastolic blood pressure).

  2. Instead of using height and weight as predictor variables, use height and body mass index (weight/height\(^2\)).

  3. When fitting a quadratic regression model, use \(X\) and \((X-\bar{X})^2\), rather than \(X\) and \(X^2\) as covariates.