11.3 Properties of different types of investigation¶
11.3.1 Description¶
In a descriptive investigation the data are used to provide a quantitative summary of features of the population of interest, or in other words the data are summarised in a compact way.
Simple descriptive analyses involve calculating proportions of individuals with a particular characteristic (e.g. males and females; smokers and non-smokers), or estimating features of the distribution of continuous variables (e.g. mean and variance of weight or blood pressure). The resulting information is then presented using tables and data visualisation.
Some descriptive analyses may extend to use of more complex methods of analysis. For example, the research question may concern how individuals within a population cluster together interms of their dietary habits, requiring clustering methods. It may be of interest to estimate theexpected survival time post-disease diagnosis in the presence of censored survival times, which would require survival analysis techniques.
All investigations should start with some basic descriptive analysis to gain understanding of the features of the data at hand. It is at this stage that we can uncover challenges such as missing data, gain insights into how certain variables are distributed, and, where relevant, gain understanding of correlations between key variables, including to identify collinearities. Some investigations then go on to the main research question, which goes beyond description, and others may be entirely descriptive and not proceed onto other questions.
Huebner et al. (2019) provide useful guidance on ‘initial data analysis’. See also Spiegelhalter (2019) for an accessible discussion of summarising and communicating descriptions of data.
11.3.2 Prediction¶
Prediction is about using data on some features of individuals to predict other features with the aim of predicting the outcome for new or future observations. More formally, prediction is concerned with mapping data on variables \(X_{1}\), \(X_{2}\), … , \(X_{p}\) to an outcome \(Y\) . The prediction model could be developed using statistical models such as regression, or approaches that would be described as machine learning algorithms.
Results from prediction investigations are used for a range of purposes: to inform people of their risk or prognosis; to identify people at high risk of an adverse event and hence take action such as more frequent screening (though the investigation will not tell us whether such screening would be effective).
Prediction models are typically developed using observational data. A well known example is the Framingham Risk Score, which provides predictions of a person’s 10-year of developing coronary heart disease (D’Agostino et al 2008).
There is a huge literature on prediction in the medical setting. See for example the books by Riley et al. (2019) and Steyerberg (2019).
11.3.3 Causality and explanation¶
In causal investigations we seek to understand the causal effect of one or more variables on an outcome. Hernan et al. (2019) describe this as “Using data to predict certain features of the world as if the world had been different”. For a simple example of a causal investigation, consider a continuous outcome \(Y\) (e.g. blood pressure) and a binary treatment variable \(X\), where \(X = 1\) denotes treated and \(X = 0\) denotes untreated. A causal investigation asks how the mean of Y would be different if all individuals had \(X = 1\) compared with if all individuals had \(X = 0\). In other words, if we could change \(X\) what would be the expected change in \(Y\) ?
Questions such as this can be arguably simple to answer using a randomized controlled trial, where there is no confounding of the treatment-outcome association. However, issues of drop-out and non-compliance are important to consider. Historically, some have considered answering causal questions to lie only in the domain of randomized experiments. However, randomized experiments are not feasible or ethical to address many important questions. It is now recognised that causality is often the goal of investigations using observational data. See for example the paper of Hernan (2018), who wrote “being explicit about the causal objective of a study reduces ambiguity in the scientific question, errors in the data analysis, and excesses in the interpretation of the results”. The field of ‘causal inference’ has developed in recent decades, with particular advances in recent years, to enable this.
Schmeuli (2010) equates causality with ‘explanation’, meaning explanation of mechanisms of how one (or more) variable affects another. However, Hernan et al. (2019) make the point that we may be able to say that \(X\) causes \(Y\) without understanding the underlying mechanism. For example we may find strong evidence from a trial that a drug is effective for a given outcome, but the precise biological mechanisms through which the effect is transmitted are not well understood.
The variable of interest in a causal investigation could be use of a medical treatment (a drug) or application of a procedure. More generally it could be an ‘exposure’ such as ‘smoking’ or ‘exercising for at least 30 minutes per day’. The ‘hypothetical intervention’ of interest should be (reasonably) well defined, even if we could never in reality intervene on it in the real world (e.g. it would be impractical, not to say unethical, to intervene on smoking status). See Hernan (2016) for a discussion of related issues.
11.3.4 Is there a fourth investigation type?¶
There is arguably a fourth investigation type which is concerned with exploring how several explanatory variables \(X_{1}\), … , \(X_{p}\) are associated with an outcome \(Y\). This might be described as an “exploration of risk factors” investigation. It may involve univariable analyses, looking at the association of each explanatory variable (“risk factor”) individually with the outcome, and multivariable analyses which look at association of several variables with the outcome in a single model. These types of analysis are typically carried out using observational data, and many (or perhaps most) epidemiological studies are investigations of this type, at least historically.
These types of investigation can be useful for understanding associations between variables in the population of interest and, as such, some may consider these analyses to be descriptive. However, as we all know, association is not causation! These types of investigation often do not consider the relative temporal ordering of explanatory variables, which means that interpretation of estimated associations as causal effects can be misleading. There is recent emphasis in the epidemiological literature on more principled investigations which are more explicit about the aim of the investigation.
Like in a prediction investigation, the interest is in several explanatory variables. However, unlike in a prediction investigation, the aim is to actually explore quantitatively the unconditional and conditional associations of the explanatory variables with \(Y\), rather than being purely on predicting \(Y\). Unlike in a causal investigation, there is not a particular focus on a single variable. However, there is often an attempt to discuss the associations as though they may be causal even though an explicit causal question has not been posed.
Investigators should be wary of over-interpreting findings from “exploration of risk factors” investigations. And if we are really interested in addressing a causal question we should be explicit about that and carry out our analysis and interpretations accordingly.