13.2 Multivariable linear regression¶

Multivariable linear regression extends the simple linear regression model to situations in which we wish to relate two or more independent variables to one outcome. Where there are multiple independent variables, we will refer to them as covariates.

There can be a number of different reasons why we would want to add more covariates in our linear regression model. Recall these two examples of questions we might want to answer using statistical models (given at the beginning of this lesson):

Does taking drug A reduce inflammation more than taking drug B in patients with arthritis?
Can we predict the risk of heart disease for our patients?

In the first example, we could use a statistical model with inflammation as the outcome and drug use as the independent variable of interest, but we may need to control (or adjust) for the confounding effects of other patient characteristics (age, gender, other medication use, etc.). In the second example, there are many different factors that could be associated with risk of heart disease (age, gender, lifestyle choices, etc.) and we may wish to include all such factors in a statistical model to predict heart disease.

Here, we introduce the multivariable linear regression model and describe how to estimate and interpret the parameters in the model using an example from the birthweight data.

Note

A note on notation. There can be some confusion between the terms multivariable models and multivariate models. Multivariate models are those which have more than one outcome variable. Such models are beyond the scope of this module; we focus our attention on univariate models which have only one outcome. Both simple linear regression models and multivariable linear regression models are considered as univariate.

13.2.1 The multivariable linear regression model¶

Suppose we wish to relate an outcome (\(Y\)) to \(p\) predictor variables \((X_1, X_2, ..., X_p)\). The appropriate multivariable linear regression model is a straightforward extension of the simple linear regression model:

\[ y_i = \beta_0 + \beta_1 x_{1i} + \beta_2 x_{2i} + ,..., \beta_p x_{ip}+\epsilon_i \text{ with } \epsilon_i \overset{iid}{\sim} N(0,\sigma^2). \]

where, \(y_i\) is the value of the dependent variable for the ith participant and \(x_{ji}\) is the value of the jth predictor variable for the ith participant.

The parameters in the model are interpreted as follows:

\(\beta_0\) is the intercept. It is the expectation of \(Y\) when all the \(X_j\)’s are zero.
\(\beta_j\) is the expected change in \(Y\) for a 1 unit increase in \(X_j\) with all the other covariates held constant.

The \(\beta_j\)’s are the regression coefficients (otherwise known as partial regression coefficients). Each one measures the effect of one covariate controlled (or adjusted) for all of the others.

13.2.2 The multivariable linear regression model in matrix notation¶

Similarly to the simple linear regression model, the multivariable linear regression model can be expressed using matrix algebra.

\[ \mathbf{Y}=\mathbf{X}\mathbf{\beta}+\mathbf{\epsilon} \text{ where }\epsilon \sim N(0,\mathbf{I}\sigma^2) \]

\[\begin{split} \begin{vmatrix}y_1\\y_2 \\. \\. \\. \\y_n \end{vmatrix}=\begin{vmatrix}1 & x_{11} & x_{12} & ... & x_{1p} \\ 1 & x_{21} & x_{22} & ... & x_{2p} \\1 & . \\1 & . \\ 1& . \\1 & x_{p1} & x_{p} & ... & x_{pn} \end{vmatrix}\begin{vmatrix} \beta_0 \\ \beta_1 \\ \beta_2 \\ ... \\ \beta_p \end{vmatrix}+\begin{vmatrix}\epsilon_1\\ \epsilon_2 \\ . \\ . \\. \\ \epsilon_n \end{vmatrix} \end{split}\]

In this formulation, \(\mathbf{X}\) is an \(n \times (p+1)\) matrix, \(Y\) and \(\epsilon\) are vectors of length \(n\) whilst \(\mathbf{\beta}\) is a vector of length \((p+1)\).

13.2.3 Estimation of the parameters¶

The regression coefficients in multivariable linear regression can be estimated by minimising the residual sum of squares:

\[\begin{split} \begin{align} SS_{RES} &= \sum_{i=1}^n \hat{\epsilon}_i^2 = \sum_{i=1}^n (y_i-\hat{y})^2 \\ &= \sum_{i=1}^n (y_i-\hat{\beta}_0-\hat{\beta}_1x_{1i}-...-\hat{\beta}_px_{pi})^2 \end{align} \end{split}\]

The closed form solution, obtained by solving the \((p+1)\) simultaneous equations that result from setting the partial derivatives of the above equation with respect to each parameter estimate to zero, can be written succinctly using matrix notation:

\[ \mathbf{\hat{\beta}}= (\mathbf{X'X})^{-1}X'Y \]

\(\mathbf{\hat{\beta}}\) is an unbiased estimator of \(\mathbf{\beta}\). Its distribution is as follows:

\[ \mathbf{\hat{\beta}} \sim \mathbf{N(\beta, (X'X)^{-1}\sigma^2)}. \]

This expresses the fact that the elements of \(\mathbf{\hat{\beta}}\) follow a multivariate normal distribution whose variances and covariances are given by \(\mathbf{(X'X)^{-1}\sigma^2}\).

It can also be shown that the following is an unbiased estimator for \(\sigma^2\):

\[\begin{split} \begin{align} \hat{\sigma}^2 &= \sum_{i=1}^n \frac{\hat{\epsilon_i}^2}{(n-(p+1))}\\ &=\sum_{i=1}^n \frac{(y_i - \hat{\beta_0} - \hat{\beta}_1x_{1i}- ... - \hat{\beta_p}x_{ip})^2}{(n-(p+1))} \end{align} \end{split}\]

While it is useful to know how these parameters are estimated, in practice they are often obtained using statistical software. Next, we demonstrate how to perform multivariable regression in R using the birthweight data and discuss the interpretation of the estimated regression coefficients.

Statistics for Health Data Science

13.2 Multivariable linear regression¶

13.2.1 The multivariable linear regression model¶

13.2.2 The multivariable linear regression model in matrix notation¶

13.2.3 Estimation of the parameters¶