Investigations and the role of regression modelling

Data scientists don’t do statistics just for fun (although, clearly statistics is indeed fun!). At the heart of each data science project is a question to be answered.

This section of the notes begins by thinking about the different types of investigation that might be carried out within a data science project. We consider three important classes of investigation type: description, prediction and causal.

We then move on to consider a commonly used family of statistical analysis: regression modelling. Three sessions introduce linear regression, beginning with the simplest type which we call simple linear regression involving a single explanatory variable. We then extend this to incorporate multiple explanatory variables, through multivariable linear regression modelling. We explore how to model various types of explanatory variables, including continuous, binary and categorical covariates and discover how to include interactions and higher-order terms (which are need to model non-linear relationships) in the regression model. The last of the linear regression sessions explores diagnostics to assess whether the underlying assumptions of the linear model hold in a particular dataset.

These ideas are then extended to other settings in the remaining two sessions. First, we meet logistic regression, an extension of linear regression modelling to settings where the outcome variable is binary. Finally, we define the Generalised Linear Model (GLM), which is a generalisation of linear regression to a wide range of settings and can be seen as a way of unifying linear, logistic and Poisson regression models, as well as many other types of regression model. We explore Poisson regression as an important example of a GLM.

We conclude this section of the notes by returning to the idea of investigations and – armed with our new knowledge about regression modelling – consider the role of regression modelling in different types of investigations.