3.2 Useful continuous distributions¶

Below are several useful probability distributions for data science in health. Some of the information below is a repeat of the Maths refresher, but we include some practical applications of each distribution.

3.2.1 The normal distribution¶

The normal distribution is defined with the following probability density function:

\[ f(x) = \frac{1}{\sqrt{2\pi}\sigma}exp \Big[-\Big(\frac{1}{2}\Big)\Big(\frac{x-\mu}{\sigma}\Big)^2\Big] \]

for values \(x\) in \((-\infty, +\infty)\). If we have a random variable \(X\) that is normally distributed we can specify this using \(X {\sim} N(\mu, \sigma^2)\). The expected value is given by \(E[X]=\mu\) and the variance is given by \(Var[X] = \sigma^2\).

A standard normal distribution has a mean of 0 and a variance of 1. A standard normal random variable is usually represented by \(Z {\sim} N(0,1)\) and is sometimes called the Z-score.

So much of statistics relies on the normal distribution, so it is an important distribution to be familiar with. We will see that the normal distribution has an important role to play in statistical inference. It is also sometimes a good distribution for directly modelling continuous variables, for example blood pressure.

3.2.2 The log-normal distribution¶

The log-normal distribution is essentialy a transformed version of the normal distribution, and has its own probability density function;

\[ f(x) = \frac{1}{x \sigma \sqrt{2\pi}}exp\Big(-\frac{(ln(x)-\mu)^2}{2\sigma^2}\Big) \]

for values \(x\) in \([0,+\infty)\). If a random variable \(X\) is log-normally distributed, \(Y=ln(X)\) has a normal distribution, and if \(Y\) is a normal distribution then \(X=exp(Y)\) has a log-normal distribution. These simple transformations mean that calculations using transformed data is the standard approach. The parameters \(\mu\) and \(\sigma\) refer to the mean and standard deviation on the normal scale. Consequently, the median of a log-normally distributed sample is \(exp(\mu)\).

Many biological datasets are log-normally distributed, for example most measurements (height, weight, speed) will be above 0, and will often be right-skewed. A good approach to take with these sorts of data is to log the data, and work on the log scale. Any inference should be converted back to the natural scale. Sometimes measurements are sufficiently greater than 0 that they become more centered. In this case, it may not be necessary to assume that are log-normal, and assuming normality may be acceptible.

3.2.3 The \(\chi^2\) distribution¶

The \(\chi^2\) distribution is here because we will use the properties of this distribution later in hypothesis testing. Its origins come from a random sample of the standard normal, where the \(\chi^2\) distribution is the distribution of the sum of squared standard normals. The degrees of freedom come from the number of standard normal random variables being summed. It is not necessary to know the parameters or estimates of the \(\chi^2\) parameters. A variable which follows the chi-squared distribution can only take positive values (i.e. greater than zero).

3.2.4 The t-distribution¶

Student’s t-distribution arises as the ratio of the sample mean to its standard error. The t-distribution has a complex density function which we shall not state here.

For now we note that the t-distribution has an additional parameter of sorts, known as the degrees of freedom (d.f.). The density function is similar to that of the standard normal, but the t-distribution has heavier tails. If \(X\) follows a t-distribution with \(\nu\) degrees of freedom, we write

\[X \sim t_\nu\]

The expectation and variance of a variable \(X\) which follows a t-distribution with \(\nu\) degrees of freedom are given by:

\(E[X] = 0\)
\(Var[X] = \frac{\nu}{\nu-2}\) if \(\nu>2\); \(\infty\) for \(1<\nu<2\); undefined otherwise

As the number of degrees of freedom increases the t-distribution gets closer and closer to the standard normal distribution.

3.2.5 The F distribution¶

The F distribution doesn’t have a simple mathematical formula, but is used extensively to compare equality of variances of two normal populations (think anova), and is used in linear regression.

For two normal populations with variances \(\sigma_1^2\) and \(\sigma_2^2\), the two random samples of size \(n_1\) and \(n_2\) with corresponding sample variance(s) \(s_1^2\) and \(s_2^2\) has the variable

\[F = \frac{s_1^2/\sigma_1^2}{s_2^2/\sigma_2^2}\]

with \(n_1-1\) and \(n_2-1\) degrees of freedom.

3.2.6 The exponential distribution¶

The exponential distribution is defined with the probability density function:

\[f(x)=\lambda e^{-\lambda x}\]

with parameter \(\lambda\), which is usually described as the rate. The limits of the distribution are \([0,\infty)\), which means values of \(x\) are always greater than 0 (and not including it).

The expected value is given by \(E[X]=\frac{1}{\lambda}\) and variance \(Var[X]=\frac{1}{\lambda^2}\).

The exponential distribution is really useful in statistics because its distribution nicely describes the time to which something occurs, if the event happens at a roughly constant rate in time. Health related examples include injuries, births and deaths (although in reality not all occur at a constant rate). The exponential distribution is important in methods such as survival analysis.

3.2.7 The uniform distribution¶

The uniform distribution is in some ways the simplest to conceptualise. A random variable that is uniformly distributed can have any value between the parameters \(a\) (min) and \(b\) (max) with equal probability;

\[f(x)= \frac{1}{b-a}\]

Outside of these limits, the probability density is 0. The expected value is \(E[X] = \frac{(a+b)}{2}\) and variance \(Var[X] = \frac{(b-a)^2}{12}\).

The uniform distribution is very commonly used when randomly allocating outcomes. An example in statistical modelling includes stochastic infectious disease modelling; here several different events (transmission, death) may have a corresponding probability and one event needs to be selected from the two options. A uniform distribution (where the maximum is the total probability of all events) is used to select

Statistics for Health Data Science