15.1 Regression modelling for binary outcomes

In many health applications, the outcome of interest is a binary outcome. Examples are 30-day mortality following surgery, 5-year survival following diagnosis of breast cancer, whether or not a particular side effect was experienced after taking a particular.

In this session, we explore a very commonly used technique in health data science: logistic regression. This can be seen as an extension of the linear regression model we have been exploring in the last few sessions. We will see that many aspects of linear regression modelling extend quite naturally to logistic regression modelling.

15.1.2 Why can’t we just use Linear Regression?

We developed our linear regression model for a continuous outcome. As we have seen, the linear regression model relies on a number of assumptions. Two of them, normality of residuals and homoscedasticity (constant residual variance) are particularly relevant to this discussion.

Normality of errors: The usual inference procedures (calculating confidence intervals and p-values) for linear regression assume that the errors follow a normal distribution. This assumption will be violated with binary outcome data.

Homoscedasticity: The error variance is constant. With binary data the variance is a function of the mean. Therefore, the variance would change as the mean changes. Therefore, this assumption is unlikely to hold.

An additional problem arises when we attempt to use linear regression to model binary outcome data, which relates to the fitted values.

Fitted values can be impossible: The mean of a binary outcome is the probability that it is a “success”, using the terminology of Bernoulli trials. This mean, because it is a probability, must lie between 0 and 1. If we tried to use linear regression to model our binary outcome we may find that some of the fitted values are below 0 or greater than 1. This problem arises largely because of the assumption that the mean is modelled by the linear predictor, an additive function of the covariates. This assumption is unlikely to be valid.

Logistic regression avoids these problems by relating the covariates (through the linear predictor) to a function of the mean, instead of the mean.