12.7 Example: binary independent variable

We now return to our second example, where we are interested in the association between birthweight and the mother’s smoking status. In exploratory analyses, we saw that mothers who do not smoke give birth to heavier babies, on average, than mothers who do smoke. We will now use a simple linear regression model to further explore this association.

The outcome variable is the same as for our previous example. A key difference, however, is that the independent variable is a binary variable.

12.7.1 The model

We have a continuous outcome variable and a binary independent variable. To include this binary variable in the model, we create a dummy variable that takes the value 1 if the mother smokes and 0 if the mother doesn’t smoke:

\[\begin{split} s_{i} = \begin{cases} 1 & \text{ if the $i^{th}$ baby's mother smokes} \\ 0 & \text{ if the $i^{th}$ baby's mother does not smoke} \end{cases} \end{split}\]

We then define the following linear regression model:

\[ \text{Model 2: } y_i = \beta_0 + \beta_1 s_i + \epsilon_i \]

When including binary (or categorical) variables in a linear regression in R, we can tell R to treat it as a factor variable using factor():

# Example 2: Investigating the relationship between birthweight and mother's smoking status.
data<- read.csv('https://www.inferentialthinking.com/data/baby.csv')
model2<-lm(Birth.Weight~factor(Maternal.Smoker), data=data)
summary(model2)
Call:
lm(formula = Birth.Weight ~ factor(Maternal.Smoker), data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-68.085 -11.085   0.915  11.181  52.915 

Coefficients:
                            Estimate Std. Error t value Pr(>|t|)    
(Intercept)                 123.0853     0.6645 185.221   <2e-16 ***
factor(Maternal.Smoker)True  -9.2661     1.0628  -8.719   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 17.77 on 1172 degrees of freedom
Multiple R-squared:  0.06091,	Adjusted R-squared:  0.06011 
F-statistic: 76.02 on 1 and 1172 DF,  p-value: < 2.2e-16
  • \(\hat{\beta}_0 = 123.09\). This is interpreted as the estimated mean birthweight (in oz) of a baby with “dummy” variable equal to 0, i.e. it is the estimated mean birthweight of babies whose mothers do not smoke.

  • \(\hat{\beta}_1=-9.23\). The mean birthweight is estimated to decrease by 9.23oz per unit increase in the “dummy” variable. A unit increase in the dummy variable equates to moving from the non-smoking group to the smoking group, so we can interpret this as the difference in mean birthweights between the two groups.

  • \(\hat{\sigma}=17.77\). The observed outcomes are scattered around the fitted regression line with a standard deviation of 17.77oz.

Exercises

  1. Perform a hypothesis test of the null hypothesis that there is no association between maternal smoking and birthweight. Write down the null hypothesis, the test statistic and the p-value. interpret your p-value.

  2. Calculate (manually or using R) a 95% confidence interval for the difference in mean birthweight between the group whose mothers smoke and those who don’t.

Try the exercise and then click the button to reveal the solution.

Solution:

  1. The null hypothesis is \(H_0: \beta_1 = 0\). The test statistic (from the R output) is \(T=-8.72\). We can check this is the ratio of the estimated slope to its standard error (-9.2661/1.0628 = -8.72). The p-value is \(p<2e_16\). In health applications, we typically write \(p<0.001\). We interpret this as strong evidence against the null hypothesis. Therefore, we can conclude there is an association between birthweight and maternal smoking.

  2. This is a large sample, so an approxiate 95% confidence interval can be obtained by the estimate plus-or-minus 1.96 times the standard error. This is:

\[ \begin{align*} -9.2661 \pm 1.96 \times 1.0628 = \end{align*} \]

This gives an interval from -11.35 to -7.18.

Alternatively, in R: confint(model2, parm=2, level=0.95)