12.6 Inference for the slope¶

Most commonly, we wish to conduct statistical inference on the estimated slope. Consequently, we focus our attention here on \(\beta_1\), but it is possible to apply the same methods to the intercept, \(\beta_0\).

For our statistical inference, we will view the values of the independent variable (\(x_1, x_2, ..., x_n\)) as fixed quantities.

12.6.1 Sampling distribution of estimated slope¶

Knowing the sampling distribution of \(\hat{\beta_1}\) allows us to perform hypothesis tests and construct confidence intervals for \(\beta_1\). Therefore, we now obtain that sampling distribution. The estimated slope, \(\hat{\beta}_1\), is a linear combination of the observed outcome values:

\[ \begin{align} \hat{\beta}_1 &= \sum_{i=1}^n \left(\frac{(x_i-\bar{x})}{\sum_{i=1}^n(x_i-\bar{x})^2}(y_i-\bar{y})\right) \end{align} \]

We can simplify this by substituting in \((y_i - \bar{y})=\beta_1(x_i-\bar{x})+(\epsilon_i-\bar{\epsilon})\), allowing us to write the estimated slope as a function of the random error (\(\epsilon\)):

\[ \begin{align} \hat{\beta}_1 &=\beta_1 + \sum_{i=1}^n \left(\frac{(x_i-\bar{x})}{\sum_{i=1}^n(x_i-\bar{x})^2}(\epsilon_i-\bar{\epsilon})\right) \end{align} \]

Since the estimated parameter is a linear combination of the \(\epsilon_i\), and \(\epsilon_i \overset{\small{iid}}{\sim} N(0, \sigma^2)\), the estimated parameter itself is normally distributed, with a distribution centred around the true value, \(\beta_1\). More specifically,

\[ \begin{align} \hat{\beta}_1 \sim N\left(\beta_1, \frac{\sigma^2}{SS_{xx}}\right), \qquad \mbox{with} \qquad SS_{xx} = \sum_{i=1}^n(x_i-\bar{x})^2 \end{align} \]

So the standard error of the slope is \(SE(\hat{\beta}_1) = \sqrt{\frac{\sigma^2}{SS_{xx}}}\). The standard error (and therefore also the variance) of \(\hat{\beta}_1\):

increases with the size of the error variance (as might be expected intuitively),
decreases with increasing sample size (larger sample sizes give more precise estimates) and
decreases as \({SS}_{xx}^2\) increases (a wider range of \(x\) values leads to more precision in the slope estimate).

We have seen that the sampling distrbution of \(\hat{\beta}_1\) follows a normal distribution. We can convert this to a standard normal distribution:

\[ \begin{align} \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\sigma^2/SS_{xx}}} \sim N(0,1) \end{align} \]

Of course, we do not know the true value of the error variance, \(\sigma\). When we replace this with our sample estimate, this changes the distribution above from a normal to a t-distribution. For large samples, these two distributions are very similar but for smaller samples the t-distribution has larger tails (suggesting that in smaller samples needing to estimate \(\sigma\) leads to more variability in the estimated slope, which makes sense intuitively).

Therefore, replacing \(\sigma\) by the estimate \(\hat{\sigma}\), we have:

\[ \begin{align} \frac{\hat{\beta}_1 - \beta_1}{\sqrt{\hat{\sigma}^2/SS_{xx}}} \sim t_{n-2} \end{align} \]

This t-distribution is the one we will use to obtain p-values and confidence intervals for the slope parameter.

12.6.2 Hypothesis testing¶

Typically, we are interested in assessing whether there is a relationship, or association, between the independent variable \(X\) and outcome \(Y\). Recall our model: \(E[Y | X=x] = \beta_0 + \beta_1 X\). Within the framework of this linear model, no association between \(X\) and \(Y\) would be reflected by a value of \(\beta_1 = 0\).

Therefore, we are typically interested in testing the null hypothesis \(H_0: \beta_1=0\) against the alternative \(H_1: \beta_1 \neq 0\).

Under our null hypothesis, the following test statistic follows a t-distribution, as we saw above:

\[ T = \frac{\hat{\beta}_1 - 0}{\sqrt{\hat{\sigma}^2/SS_{xx}}} \sim t_{n-2} \]

We now follow the familiar process from the session in hypothesis testing. We evaluate \(T\) in our particular sample and then calculate the probability of obtaining that value or one more extreme for a \(t\)-distribution with \(n-2\) degrees of freedom.

Notes

\(T\) is simply the estimate divided by its standard error. This is a familiar form of test statistic which we saw in the session about hypothesis testing.

Although typically we are interested in the null hypothesis \(H_0: \beta_1=0\) we could use the same approach to test other null hypotheses, such as \(H_0: \beta_1=5\). However, it is rare that we are interested in values other than 0. Therefore, p-values outputted by statistical software arise from testing the null hypothesis value of 0 by default.

Example¶

Returning to our linear model relating birthweight to gestational days in the baby dataset, we will now test the null hypothesis that there is no association between gestational days and birthweight, i.e. \(H_0: \beta_1 = 0\) in Model 1.

First, we rerun the code to reproduce the linear model output.

data<- read.csv('https://www.inferentialthinking.com/data/baby.csv')
model1<-lm(Birth.Weight~Gestational.Days, data=data)
summary(model1)

Call:
lm(formula = Birth.Weight ~ Gestational.Days, data = data)

Residuals:
    Min      1Q  Median      3Q     Max 
-49.348 -11.065   0.218  10.101  57.704 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)      -10.75414    8.53693   -1.26    0.208    
Gestational.Days   0.46656    0.03054   15.28   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 16.74 on 1172 degrees of freedom
Multiple R-squared:  0.1661,	Adjusted R-squared:  0.1654 
F-statistic: 233.4 on 1 and 1172 DF,  p-value: < 2.2e-16

In the above output, the column Std.Error gives the standard errors of the estimated intercept and slope.

The columns t value and Pr(>|t|) give the test statistic and associated \(p\)-value for a hypothesis test, testing the null hypothesis that \(H_0: \beta_0 = 0\) in the top row and \(H_0: \beta_1 = 0\) in the bottom row.

Note that the t-value for the slope (which we call \(T\) in the discussion above) is 15.28. You can check that this is equal to the estimate divided by the standard error (0.46656/0.03054 = 15.277014).

To test the null hypothesis that \(H_0: \beta_1=0\) against the alternative \(H_1:\beta_1 \neq 0\), the test statistic is 15.28 and the associated \(p\)-value is \(<2\times10^{-16}\). This is a very small \(p\)-value and therefore the data provide strong evidence against the null hypothesis. Based on these results, we can conclude that birthweight is associated with length of pregnancy.

12.6.3 Confidence intervals for the regression coefficients¶

We saw above that:

\[ \begin{align} \frac{\hat{\beta}_1 - \beta_1}{\hat{SE}(\hat{\beta}_1) } \sim t_{n-2} \end{align} \]

with \(\hat{SE}(\hat{\beta}_1) = \sqrt{\hat{\sigma}^2/SS_{xx}}\). This leads to the following 95% confidence interval for \(\beta_1\):

\[ \hat{\beta}_1 \pm t_{0.025, n-2} \hat{SE}(\hat{\beta}_1) \]

where \(t_{0.025, n-2}\) is the 97.5\(^{th}\) percentile of a \(t\)-distribution with \(n-2\) degrees of freedom. For large samples, this value will be approximately 1.96. For smaller samples it will be a slightly larger number (reflecting additional imprecision in our estimate).

If \(n\) is sufficiently large, the t-distribution is well approximated by a normal distribution. In this case, a 95% confidence interval can be found by:

\[ \hat{\beta}_1 \pm 1.96 \times \hat{SE}(\hat{\beta}_1) \]

Example: Calculate a 95% for \(\hat{\beta}_1\) (using the values given in the R output above):

Click the button to reveal the solution.

Solution:

\[\begin{split} \hat{\beta}_1 \pm 1.96 \times SE(\hat{\beta}_1) \\ 0.46656 \pm 1.96 \times 0.03054 \end{split}\]

which gives a 95% CI from 0.407 to 0.526.

The data are consistent with an increase in birthweight per daily increase in gestational age between 0.41oz and 0.52oz.

Since 0 does not lie within the interval, we conclude that there is evidence of an association between birthweight and length of pregancy (as also indicated by the results of the hypothesis test).

Alternatively, we can obtain confidence intervals using confint in R. The option parm tells R which regression coefficients to provide confidence intervals for. Try omitting this option or changing it to value 1 to see what happens.

# Confidence intervals for the slope, beta_1
confint(model1, parm=2, level=0.95)

	2.5 %	97.5 %
Gestational.Days	0.4066435	0.5264702

Statistics for Health Data Science

12.6 Inference for the slope¶

12.6.1 Sampling distribution of estimated slope¶

12.6.2 Hypothesis testing¶

Example¶

12.6.3 Confidence intervals for the regression coefficients¶