13.1 Categorical independent variables¶

We have explored simple linear regression with a continuous independent variable and with a binary variable. We now extend these ideas to include a categorical independent variable.

We will return to the baby example and use linear regression to explore the association between maternal Body Mass Index (BMI) category and the baby’s birthweight.

13.1.1 Dummy variables¶

We have height measured in inches and weight measured in pounds. BMI is obtained using the formula \(BMI = 703 \times weight (lb) / height (in)^2\). We will then categorise BMI according to the World Health Organisation’s classification.

We will define a categorical variable \(c_{i}\) denoting the \(i^{th}\) mother’s BMI category, defined as follows:

\[\begin{split} c_{i} = \begin{cases} 1 & \text{ if the mother's BMI is less than 18.5 (underweight)} \\ 2 & \text{ if the mother's BMI is at least 18.5 and less than 25 (normal)} \\ 3 & \text{ if the mother's BMI is at least 25 and less than 30 (overweight)} \\ 4 & \text{ if the mother's BMI is 30 or more (obese)} \end{cases} \end{split}\]

Our BMI categorical variable has four categories. To distinguish between all four categories we need three dummy variables. We choose a baseline or reference group, which for us will be the underweight category. For each of the other categories, we create a dummy variable which indicates that the woman is in (or not) that category. Specifically, we define our dummy variables as:

\[\begin{split} w_{1i} = \begin{cases} 1 & \text{ if } c_{i}=2\\ 0 & \text{ if } c_{i} \neq 2 \end{cases} \end{split}\]

and

\[\begin{split} w_{2i} = \begin{cases} 1 & \text{ if } c_{i}=3\\ 0 & \text{ if } c_{i} \neq 3 \end{cases} \end{split}\]

and

\[\begin{split} w_{3i} = \begin{cases} 1 & \text{ if } c_{i}=4\\ 0 & \text{ if } c_{i} \neq 4 \end{cases} \end{split}\]

The R code below read in the baby data and create variables containing the mother’s BMI (Maternal.BMI) and the mother’s BMI category (Maternal.BMIcat).

data<- read.csv('https://www.inferentialthinking.com/data/baby.csv')
# Calculate maternal BMI (with conversion factor due to measurement in lb and in)
data$Maternal.BMI <- 703*data$Maternal.Pregnancy.Weight/(data$Maternal.Height)**2

# Categorise the BMI values
data$Maternal.BMIcat <-1
data$Maternal.BMIcat[data$Maternal.BMI>=18.5 & data$Maternal.BMI<25]<-2
data$Maternal.BMIcat[data$Maternal.BMI>=25   & data$Maternal.BMI<30]<-3
data$Maternal.BMIcat[data$Maternal.BMI>=30]<-4

# Tabulate the BMI categories
table(data$Maternal.BMIcat)

  1   2   3   4 
 84 932 124  34 

13.1.1 The model¶

A linear regression model relating birthweight (\(Y\), the outcome) to the three dummy variables (\(W_1, W_2, W_3\)) representing the mother’s BMI category (\(C\), the categorical independent variable) is defined as:

\[ \text{Model 3: } y_i = \beta_0 + \beta_{1}w_{1,i} + \beta_{2}w_{2,i} + \beta_{3}w_{3,i} + \epsilon_i \]

The equation above can also be written as follows:

\[\begin{split} \begin{align} y_i &= \beta_0 + \epsilon_i \text{ if } c_i=1 \text{ (underweight mothers) } \\ y_i &= \beta_0 + \beta_1 + \epsilon_i \text{ if } c_i=2 \text{ (normal weight mothers) } \\ y_i &= \beta_0 + \beta_2 + \epsilon_i \text{ if } c_i=3 \text{ (overweight mothers) } \\ y_i &= \beta_0 + \beta_3 + \epsilon_i \text{ if } c_i=4 \text{ (obese mothers) } \\ \end{align} \end{split}\]

This makes explicit the interpretation of the parameters in the model.

\(\beta_0\) is the expectation of \(Y\) when \(C=1\)
\(\beta_0 + \beta_1\) is the expectation of \(Y\) when \(C=2\). Hence \(\beta_1\) is the difference in the expectation of \(Y\) between groups defined by \(C=1\) and \(C=2\).
\(\beta_0 + \beta_2\) is the expectation of \(Y\) when \(C=3\). Hence \(\beta_2\) is the difference in the expectation of \(Y\) between groups defined by \(C=1\) and \(C=3\).
\(\beta_0 + \beta_3\) is the expectation of \(Y\) when \(C=4\). Hence \(\beta_3\) is the difference in the expectation of \(Y\) between groups defined by \(C=1\) and \(C=4\).

Notes

In this parameterisation of the model, the group defined by \(C=0\) is often referred to as the baseline group. There is no statistical reason why one group rather than another should be chosen as the baseline group. It can sometimes be desirable to re-parameterise a model of this type to estimate parameters representating differences in mean levels from a particular baseline group. In this example, for instance, we might instead want the group with normal weight to be our baseline group, in which case we would need to redefine our first dummy variable.

Note that, in contrast to the models that we have met so far, this has more than one variable in it (even though the three dummy variables together measure a single characteristic). Therefore, this is no longer strictly a simple linear regression model. It is an example of a multivariable linear regression model. We will discuss general theory for this model later. Broadly speaking, the ideas we have met in the context of simple linear regression extend to this more general model very naturally.

# Model 3: Relating birthweight to length of pregnancy and mother's height group. 
model3<-lm(Birth.Weight~factor(Maternal.BMIcat), data=data)
summary(model3)

Call:
lm(formula = Birth.Weight ~ factor(Maternal.BMIcat), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-64.828 -10.828   0.172  11.172  57.677 

Coefficients:
                         Estimate Std. Error t value Pr(>|t|)    
(Intercept)               115.286      1.996  57.752   <2e-16 ***
factor(Maternal.BMIcat)2    4.543      2.084   2.180   0.0295 *  
factor(Maternal.BMIcat)3    3.037      2.585   1.175   0.2404    
factor(Maternal.BMIcat)4    8.626      3.719   2.320   0.0205 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 18.3 on 1170 degrees of freedom
Multiple R-squared:  0.006152,	Adjusted R-squared:  0.003604 
F-statistic: 2.414 on 3 and 1170 DF,  p-value: 0.06511

\(\hat{\beta}_0 = 115.29\). This is interpreted as the estimated mean birthweight (in oz) of a baby with all “dummy” variables equal to 0, i.e. it is the estimated mean birthweight of babies in our baseline category (those with mothers who are underweight).
\(\hat{\beta}_1= 4.543\). The mean birthweight is estimated to increase by 4.5 oz per unit increase in the first “dummy” variable. A unit increase in the first dummy variable equates to moving from the underweight group to the normal weight group. So we can interpret this as the difference in mean birthweights between the group whose mothers have normal BMI and those whose mothers are underweight.
\(\hat{\beta}_2= 3.037\). The mean birthweight is estimated to increase by 3.0 oz per unit increase in the second “dummy” variable. A unit increase in the second dummy variable equates to moving from the underweight group to the overweight group. So we can interpret this as the difference in mean birthweights between the group whose mothers are overweight and those whose mothers are underweight.
\(\hat{\beta}_3= 8.626\). The mean birthweight is estimated to increase by 8.6 oz per unit increase in the third “dummy” variable. A unit increase in the first dummy variable equates to moving from the underweight group to the obese group. So we can interpret this as the difference in mean birthweights between the group whose mothers are obese and those whose mothers are underweight.

Overall, we see a pattern of higher maternal BMI being associated with higher birthweights, particularly for the group with obese mothers.

\(\hat{\sigma}=18.3\). The observed outcomes are scattered around the fitted regression line with a standard deviation of 18.3oz.

We can obtain confidence intervals around the three BMI estimates as follows:

confint(model3, level=0.95)

	2.5 %	97.5 %
(Intercept)	111.3691529	119.202276
factor(Maternal.BMIcat)2	0.4533602	8.631864
factor(Maternal.BMIcat)3	-2.0356770	8.109410
factor(Maternal.BMIcat)4	1.3296865	15.922414

The R output from the model provides p-values for each of the three coefficients relating maternal BMI to birth weight. However, we are typically interested in the broad question of whether maternal BMI is related to birth weight, rather than whether an individual dummy variable is related to the outcome.

Therefore, when we have a categorical variable in a regression model, the hypothesis of interest usually relates to the combination of all dummy variables representing the categorical variable. An appropriate hypothesis test jointly tests the hypothesis that all coefficients for dummy variables are zero, i.e.

\(H_0: \beta_1 = 0, \beta_2 = 0, \beta_3 = 0\)
\(H_1: \text{at least one of } \ \beta_1, \beta_2, \beta_3 \neq 0\)

We use a partial \(F\)-test to test this hypothesis. Details are beyond the scope of the course, but are outlined in the appendix to this session. The R code to obtain the joint p-value is shown below.

# Remove maternal BMI from the model (i.e. a constant-only model)
model3_without<-lm(Birth.Weight~1, data=data)

anova(model3, model3_without)

Res.Df	RSS	Df	Sum of Sq	F	Pr(>F)
1170	391633.5	NA	NA	NA	NA
1173	394057.9	-3	-2424.344	2.414232	0.06510651

In this case we have a p-value of p=0.065, indicating some evidence against the null hypothesis of no association bewteen maternal BMI category and the baby’s birthweight.

13.1.2 Categorising continuous variables¶

In the above example, we have categorised an continuous variable (BMI) in order to demonstrate how a categorical variable should be included in a linear regression model. This is important to know, since there are many variables that are categorical by definition and may be required for a statistical analysis. For example: cancer stage, ethnicity, education level, etc. While these examples should be included as a categorical variable in a linear regression model, it is not, in general, recommended to categorise a continuous variable in a linear model. We did so above, purely for pedagogical reasons.

One of the problems with categorising continuous variables is that it is difficult to decide what the cut-off for each category should be. In the example above, however, there are widely used categorisations.

We often lose information by categorising continous variables. We can often obtain a better and more parsimonious fit (using fewer parameters to describe the relationship) by modelling the continuous variable without categorisation. We will return to these ideas later.

Statistics for Health Data Science

13.1 Categorical independent variables¶

13.1.1 Dummy variables¶

13.1.1 The model¶

13.1.2 Categorising continuous variables¶