14.3 Dealing with violations of assumptions¶

So far, we have discussed diagnostic tools that are useful for identifying possible violations of the assumptions of a linear model. Identification of potential violations of concern is only the first, and arguably the easiest, aspect of an exploration of the robustness of the results of fitting a model. Here, we briefly describe some approaches that can be used to deal with violations. When these approaches do not work, then more complex methods (beyond the scope of this lesson) may be needed to analyse the data.

14.3.1 Checking the data¶

Clearly, it is important that errors in data are eliminated as far as possible. In practice, ensuring that a large dataset is 100% error free may be impossible. Observations with large standardised residuals can potentially arise through data entry or coding errors and so a useful first step is to check such values with the data provider or original source of data, if available.

14.3.2 Transformations¶

Sometimes it can be useful to transform either the outcome variable and/or one or more of the covariates. The transformed variables are then used in the analysis in replacement of the original variables. There are a number of possible motivations for this:

Transformations can be used to convert a non-linear relationship into a linear one. For example:

\[ y_i = \alpha(x_i)^{\beta} ⇒log(y_i) = log (\alpha)+\beta log (x_i). \]

Transformations can be used to improve the normality of residuals. For example, the Box-Cox transformation is a power transformation for this purpose.
Transformations can help stabilise the variance of residuals. For example, if \(\hat{\sigma}^2\) is proportional to \([E(Y)]^2\) then \(y^*=log(y)\) is a useful variance-stabilising transformation. Alternatively, if \(\hat{\sigma}^2\) is proportional to \([E(Y)]^3\) then \(y^*=1/\sqrt{y}\) can be used.

14.3.3 Sensitivity analyses¶

If we observe potentially problematic outliers, sensitivity analyses can be used to assess how problematic they are. This involves repeating the analysis after omitting the outlier (or group of outliers) and considering the extent to which the results are altered.

However, even if the outlier affects the results (and/or assumptions) it is not a good idea to simply drop the data point. If it is not a data error, then it is a legitimate observation that should be included and understanding the reasons why it is an outlier could be important. In most cases, it is preferable to report the results including all data points, but discuss the impact removing the outlier had on the results.

Statistics for Health Data Science

14.3 Dealing with violations of assumptions¶

14.3.1 Checking the data¶

14.3.2 Transformations¶

14.3.3 Sensitivity analyses¶