7.4 Approximate confidence intervals for parameters estimated using large samples¶

You will encounter many different confidence intervals during your studies. Many of these rely on an asymptotic normal distribution, as described below.

There are many different confidence intervals and many approaches to calculating confidence intervals. We do not aim to give you a comprehensive list here. Below, we describe a few commonly used confidence intervals to give you a flavour. Please note: We do not expect you to memorise these formulae.

7.4.1 Normal-based confidence intervals¶

The Central Limit Theorem tells us that the mean of independent identically distributed random variables, with finite expectation and variance, tends to a normal distribution as the sample size tends to infinity.

In fact, the Central Limit Theorem means that most typically encountered parameter estimators tends to normal as the sample sizes tend to infinity. So we can follow a very similar approach to the one above to construct confidence intervals for any parameter estimators that follow an approximate normal distribution when sample sizes are large, giving a confidence interval of the form

\[ \mbox{Estimate} \pm 1.96 \times SE(\mbox{Estimator}) \]

7.4.2 Proportions and rates¶

First, we need some notation.

Proportion

We are estimating a population proportion from a single observation from a binomial distribution.

Our observed data consist of one observation from \(X \sim binomial(n, \pi)\), with the realised (observed) value being \(X=k\).

Rate

We are estimating a population rate (per person-year), from the total number of events out of \(P\) person-years of observation.

Our observed data consist of one observation from \(X \sim Poisson(\lambda P)\). The realised (observed) value is \(X=d\).

Logarithm of rate

For the rate, we may wish to perform our calculations on the log scale. These confidence intervals are approximate; the approximation can work better following a transformation (e.g. the log). This is one example of that approach.

To do this, we need to define the log-rate, \(\nu = log(\lambda)\)

Using this notation, we can write down the estimate of the parameter of interest, it’s standard error and an approximate 95% confidence interval. These are shown in the table below.

	Estimate of parameter	Standard Error	Approximate 95% Confidence Interval
Proportion	\(\hat{\pi} = \frac{k}{n}\)	\(\sqrt{\frac{\pi (1-\pi)}{n}}\)	\(\hat{\pi} \pm 1.96 \times \sqrt{\frac{\hat{\pi} (1-\hat{\pi})}{n}}\)
Rate	\(\hat{\lambda} = \frac{d}{P}\)	\(\frac{\lambda}{\sqrt{d}}\)	\(\hat{\lambda} \pm 1.96 \times \frac{\hat{\lambda}}{\sqrt{d}}\)
Log Rate	\(\hat{\nu} = log\left(\frac{d}{P}\right)\)	\(\sqrt{\frac{1}{\lambda P}}\)	\(\hat{\nu} \pm 1.96 \times \sqrt{\frac{1}{e^{\hat{\nu}} P}}\)

The three tabs below provide examples of using the formulae above to obtain approximate 95% confidence intervals for proportions and rates.

Proportion

We want to estimate the population proportion of patients who experience a side effect from a particular drug.
In a clinical study of \(80\) patients given the drug, \(X=20\) experience a side effect.
Our estimate of the population proportion experiencing a side effect is \(\hat{\pi} = 0.25\)
Our 95% confidence interval for this proportion is:

\[ 0.25 \pm 1.96 \times \sqrt{\frac{0.25 (1-0.25)}{80}} \]

This gives a range of 0.155 to 0.349.
So our estimate of the proportion of patients who experience a side effect is: 0.25 (95% CI 0.155 to 0.349). Our best guess is that 25% of patients experience a side-effect from this drug. We are 95% confident that the true proportion lies between 15.5% and 34.9%.

Rate

We want to estimate the rate of panic attacks among adults with a mild anxiety disorder.
Suppose we observe \(80\) patients with the disorder for 1 year, so \(P=80\). In total, these patients experience \(d=2\) panic attacks during the year.
Our estimate of the annual rate of panic attacks per person is \(\hat{\lambda} = 2/80 = 0.025\).
Our 95% confidence interval for the rate is:

\[ 0.025 \pm 1.96 \times \frac{0.025}{\sqrt{2}} \]

This gives a range of -0.0096 to 0.0596. This illustrates an important point - approximate confidence intervals sometimes contain impossible parameter values (the rate \(\lambda\) cannot be negative). To resolve this problem, we will re-do the calculations on the log scale.

Logarithm of the rate

The logarithm of the observed rate is \(\nu = log(0.025) = -3.689\). So this is our estimate of the log rate.
Our 95% confidence interval for the log-rate is:

\[ -3.689 \pm 1.96 \times \sqrt{\frac{1}{0.025 \times 80}} \]

This gives a range of -5.075 to -2.303. This is an interval within which we are confident the log-rate lies. To obtain an interval on the original scale, we take the exponential transformation of each of these values:

\[ (e^{-5.075}=0.006, e^{-2.303}=0.0999). \]

So our estimated rate and 95% confidence interval is: 0.025 (95% CI 0.006 to 0.0999). We are 95% confident that the true rate of panic attacks per year per person lies between 0.006 and 0.0999.

7.4.2 The mean¶

In this subsection we consider estimating a population mean. Our observed data comprise \(n\) independent observations, \(x_1, x_2, ..., x_n\). We consider two possibilities:

Data are normally distributed, \(X_i \sim Normal(\mu, \sigma^2),\) for \(i=1,..,n\)
Data are not normally distributed.

In each case, the population mean and variance (the square of the population standard deviation) are:

\[ E[X] = \mu, \ \ Var(X) = \sigma^2 \]

The sample mean is \(\bar{x}\) and the sample standard deviation is \(s\). Our estimate of the population mean is just the sample mean: \(\hat{\mu} = \bar{x}\). And, as we have seen, the standard error is given by \(\frac{\sigma}{\sqrt{n}}\).

There are various ways of constructing a 95% confidence interval, depending on the situation. These are shown in the table below.

	Approximate 95% Confidence Interval
Small samples
- Normal distribution, known \(\sigma\)	\(\hat{\mu} \pm 1.96 \times \frac{\sigma}{\sqrt{n}}\)
- Normal distribution, unknown \(\sigma\)	\(\hat{\mu} \pm t_{n-1} \times \frac{s}{\sqrt{n}}\)
Large samples
- Normal or not, known \(\sigma\)	\(\hat{\mu} \pm 1.96 \times \frac{\sigma}{\sqrt{n}}\)
- Normal or not, unknown \(\sigma\)	\(\hat{\mu} \pm 1.96 \times \frac{s}{\sqrt{n}}\)

where \(t_{n-1}\) is a number obtained from the t-distribution, which is similar to the standard normal distribution. This number is around 2 for small \(n\) and becomes very close to \(1.96\) for large samples.

For small samples, where the data are not normally distributed, the confidence interval assuming data are normally distributed often has reasonable performance, but other methods (e.g. bootstrap confidence intervals) may be advisable.

7.4.3 Comparing two groups¶

There are many ways of comparing outcomes between two groups. Two popular options are the difference in proportions for binary outcomes and the difference in means for continuous outcomes. Confidence intervals for other measures (e.g. the risk ratio, the odds ratio, the difference in medians, etc.) also can be obtained.

Difference in proportions

We are interested in the population difference in proportions from two observations from two binomial distributions. Suppose our observed data consist of two observations from \(X_1 \sim binomial(n_1, \pi_1)\), with the realised values being \(X_1=k_1\) and \(X_2 = k_2\). We want to estimate the difference \(\delta = \pi_1 - \pi_2\).

We estimate the proportion in the first group by \(\hat{\pi}_1 = \frac{k_1}{n_1}\). Similarly, we estimate the proportion in the second group by \(\hat{\pi}_2 = \frac{k_2}{n_2}\). Then our estimate of the difference in proportions is \(\hat{\delta} = \hat{\pi}_1 - \hat{\pi}_2\).

The standard error for the difference in proportions is:

\[ \sqrt{\frac{\pi_1 (1-\pi_1)}{n_1} + \frac{\pi_2 (1-\pi_2)}{n_2}} \]

And we can obtain an approximate 95% confidence interval as:

\[\hat{\delta} \pm 1.96 \times \sqrt{\frac{\hat{\pi}_1 (1-\hat{\pi}_1)}{n_1} + \frac{\hat{\pi}_2 (1-\hat{\pi}_2)}{n_2}}\]

Difference in means

We are interested in the difference in population means between two groups from \(n_1\) iid observations from a normal distribution from group 1 and \(n_2\) from group 2. Suppose our observed data are \(n_1\) observations drawn from \(X_i \sim Normal(\mu_1, \sigma^2),\) for \(i=1,..,n_1\) and \(n_2\) observations drawn from \(X_i \sim Normal(\mu_2, \sigma^2),\) for \(i=1,..,n_2\). We want to estimate the difference \(\delta = \mu_1 - \mu_2\)

The sample means are \(\bar{x}_1\) and \(\bar{x}_2\) and the sample standard deviations are \(s_1\) and \(s_2\).

We can obtain a pooled estimate of the standard deviation, if we’re happy to assume that these are equal, as follows

\[ s = \sqrt{\frac{(n_1 - 1) s_1^2 + (n_2 - 1) s_1^2 }{n_1 + n_2 - 2}} \]

Our estimate of the difference in population means is: \(\hat{\delta} = \hat{\mu}_1 - \hat{\mu}_2\). This has standard error:

\[ SE(\hat{\delta}) = \sigma \sqrt{\frac{1}{n_1} + \frac{1}{n_2}} \]

Various confidence intervals can be obtained, depending on the setting. These are shown in the table below.

	Approximate 95% Confidence Interval
Small samples
- Normal distribution, known \(\sigma\)	\(\hat{\mu} \pm 1.96 \times \sigma \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\)
- Normal distribution, unknown \(\sigma\)	\(\hat{\mu} \pm t_{n_1+n_2-2} \times s \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\)
Large samples
- Normal or not, known \(\sigma\)	\(\hat{\mu} \pm 1.96 \times s \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\)
- Normal or not, unknown \(\sigma\)	\(\hat{\mu} \pm 1.96 \times s \sqrt{\frac{1}{n_1} + \frac{1}{n_2}}\)

where \(t_{n_1+n_2-2}\) is a number obtained from the t-distribution with \(n_1 + n_2 - 2\) degrees of freedom. This will give a number that takes a value of around 2 for smaller samples and approximately 1.96 for larger samples.

Modified intervals that do not assume equality of standard deviation in the two groups also exist.

Statistics for Health Data Science

7.4 Approximate confidence intervals for parameters estimated using large samples¶

7.4.1 Normal-based confidence intervals¶

7.4.2 Proportions and rates¶

7.4.2 The mean¶

7.4.3 Comparing two groups¶