17. The role of regression in different types of investigation

At the start of this section of the notes, we considered different types of investigation that might be of interest within a health data science project. We grouped these investigations into three classes: descriptive, predictive and causal.

Having explored various types of regression modelling, we now revisit the idea of the underlying investigation and consider the role of the regression model in different types of investigation.

We often use the same statistical tools to address research questions for investigations of different types. Regression is a key tool for analyses in all the types of investigation. In this short session we illustrate how the same regression model could be used in prediction and causal investigations, but that the output from the regression should be used and interpreted differently.

17.1 Simple example

We focus on a simple (fictitious) observational study involving three variables: two binary explanatory variables ‘maternal smoking status’ (\(X_{1}\) = 1: smoker, \(X_{1}\) = 0: non-smoker) and maternal socioeconomic status (\(X_{2}\) = 1: low, \(X_{2}\) = 0: high), and a continuous outcome ‘birth weight’ (measured in grams). The assumed relationships between the three variables are summarised in the causal diagram in the figure below.

For the purposes of a simple illustration, we suppose that these are the only three variables at play in this ‘system’. In reality of course there are many other maternal and other characteristics that affect a baby’s birthweight, such as genetics, maternal diet and alcohol consumption, mother’s access to prenatal care, and other features of the environment.

Consider a linear regression of \(Y\) on \(X_{1}\), \(X_{2}\), i.e.

\[ Y = \beta_{0} + \beta_{1}X_{1} + \beta_{2}X_{2} + \epsilon \]
(1)

The estimated regression coefficients and corresponding 95% confidence intervals for this fictitious example are:

\[\begin{split} \begin{align*} \hat{\beta_{0}} &= 3227 \ \ (95\% \text{CI}: \ 1603, 4851) \\ \hat{\beta_{1}} &= −341 \ \ (95\% \text{CI}: \ −513, −169) \\ \hat{\beta_{2}} &= −214 \ \ (95\% \text{CI}: \ −410, −18) \end{align*} \end{split}\]

We now consider how the output from this regression could be used in different investigation types.

17.2 Prediction

If the aim is to predict birth weight based on the two characteristics of the mother, this model allows us to do this. We could obtain the expected value of \(Y\) given \(X_{1}\) and \(X_{2}\) (in this very simple example there are only 4 possible combinations).

In this prediction setting we do not, however, particularly care about the estimates of the regression coefficients. We should instead be concerned with the predictive performance of the model. This could be measured, for example, using \(R^2\), which measures the proportion of variability in the outcome that is explained by the statistical model.

There are many details about how to appropriately assess and quantify the predictive performance of a prediction model which we do not discuss here.

17.3 Causality and explanation

Suppose instead that the aim is to assess the causal effect of maternal smoking (\(X_{1}\)) on birth weight (\(Y\)). In the simple setting shown in the causal diagram above, maternal socioeconomic status (\(X_{2}\)) is the only confounder of the association between \(X_{1}\) and \(Y\).

The regression model in equation (1) adjusts for \(X_{2}\) and hence the coefficient for \(X_{1}\) can be interpreted as the conditional causal effect of \(X_{1}\) on \(Y\).

We can make the interpretation that if all mothers in the study population had smoked, the mean birthweight would have been 341 grams lower than had all mothers in the study population not smoked. This is referred to as an ‘average causal effect’. The 95% confidence interval can be used to provide information about how precisely we believe we have estimated this causal effect. In this case, the 95% confidence interval excludes 0 and includes only negative numbers, running from -513 to -169.

Here, we have not given any interpretation of the estimate of \(\beta_{2}\) because it wasn’t relevant for our research question, even though it was important to adjust for \(X_{2}\) to adjust for confounding. In a more realistic setting, there will be many other variables that confound the association between \(X_{1}\) and \(Y\) and which would need to be accounted for to enable a causal interpretation of \(\beta_{1}\).

17.4 The “Table 2 Fallacy”

After adjusting for maternal socioeconomic status, maternal smoking was associated with a lowering of 341 grams in mean birthweight. After adjusting for maternal smoking, low maternal socioeconomic status was associated with a lowering of 214 grams in mean birthweight. However, \(\beta_{1}\) and \(\beta_{2}\) in model (1) do not have the same type of interpretation. This is due to the relationships between the three variables.

  • Interpreting \(\beta_2\): According to the causal diagram above, maternal smoking status is on the causal pathway from socioeconomic status to birth weight. Hence the parameter \(\beta_2\) represents the effect of socioeconomic status on birth weight that does not go through smoking status - this is a ‘direct effect’ rather than a ‘total effect’.

  • Interpreting \(\beta_1\): By contrast, \(\beta_{1}\) represents the total effect of smoking status on birth weight. We do not go into details about definitions of different types of effect. The aim here is simply to point out that the correct interpretation of the coefficients in the regression model in (1) depends on assumptions about the inter-relationships between the three variables, including how they are ordered in time.

In some (or perhaps many) epidemiological investigations that involve exploration of risk factors, estimates of regression coefficients from multivariable models such as that in (1) (and versions with many more explanatory variables) are presented alongside one another in a table, together with confidence intervals and p-values. They may then be interpreted as though all coefficients had the same meaning, ignoring possible inter-relationships between the variables and temporal ordering. As we have seen from the above example, this could be misleading. This problem has been referred to in the literature the ‘Table 2 fallacy’, because the estimates of regression coefficients are often presented in ‘Table 2’ in a paper (where ‘Table 1’ is usually a table of descriptive statistics). See Westreich and Greenland (2013) for a description of the Table 2 fallacy. Bandoli et al. (2018) provide an example in the context of preeclampsia and preterm birth.

17.5 Other analysis approaches

Regression modelling is a fundamental part of the statistician’s toolbox and is used in many investigations of different types. We have used regression modelling to illustrate the connection between the analysis method and the underlying aim of the investigation.

However, regression models are not the only tool available. You will come across many other types of analysis method, such as clustering or neural networks, which can also be used in various types of investigation.