3.5 Joint distributions and correlations¶
We are often interested not in the distribution of a single variable but in the relationship between two or more variables. This requires us to understand the concepts of joint distributions and correlation.
Returning to the BMI dataset, a high BMI is indicative of being overweight and this is likely to mean that an individual may have a high percentage of body fat. Typically, those individuals with high BMI may also be at risk of health conditions such as heart disease, which may be indicated by high blood pressure.
If we wish to address questions relating to two or more variables, we need to understand their joint distribution.
3.5.1. Joint distributions¶
If we have two random variables \(X\) and \(Y\), the cumulative joint distribution function (CDF) is,
regardless of whether \(X\) and \(Y\) are continuous or discrete. For continuous random variables the joint density function will be \(f(x,y)\) and will be non-negative and
3.5.2 Marginal distributions¶
We might sometimes want to think about the marginal density of, say, \(X\). This means we want to know the probability of \(X\) irrespective of \(Y\), and consequently we will need to integrate over all possible values of \(Y\). The marginal cdf of \(X\), or \(F_X\) is
From this, it follows that the density function of \(X\) alone, known as the marginal density of \(X\), is
Note that this is different to assuming that \(X\) is independent of \(Y\).
So what does this mean in practical terms? Returning to the BMI data we can report that the average BMI (\(\mu_X\)) is 26.46 and the average body fat percentage (\(\mu_Y\)) is 35.31. If BMI and body fat were independent variables knowing BMI would tell us nothing about body fat and vice versa. But plotting the data (and some common sense) tells us that this is not the case; if we know one we can say quite a lot about the other. We could explore the correlation between the data (more about this later), but we can also describe these variables together using a joint distribution. By defining them using a joint distribution we are saying nothing about cause and effect, just that they are dependent variables.
options(repr.plot.width=4, repr.plot.height=3)
# BMI dataset
dat <- read.csv("Practicals/Datasets/BMI/MindsetMatters.csv")
head(dat)
#remove observations with no BMI data
dat <- dat[!is.na(dat$BMI),]
# scatter plot of BMI and body fat
ggplot(dat,aes(x=BMI,y=Fat)) + geom_point()
# report the mean of each variable
# note that some values of Y are missing...we need to add na.rm otherwise the estimate will be NA
mux <- mean(dat$BMI)
print(paste0("value of mu_x is ",round(mux,2)))
muy <- mean(dat$Fat,na.rm=T)
print(paste0("value of mu_y is ",round(muy,2)))
Cond | Age | Wt | Wt2 | BMI | BMI2 | Fat | Fat2 | WHR | WHR2 | Syst | Syst2 | Diast | Diast2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 43 | 137 | 137.4 | 25.1 | 25.1 | 31.9 | 32.8 | 0.79 | 0.79 | 124 | 118 | 70 | 73 |
0 | 42 | 150 | 147.0 | 29.3 | 28.7 | 35.5 | NA | 0.81 | 0.81 | 119 | 112 | 80 | 68 |
0 | 41 | 124 | 124.8 | 26.9 | 27.0 | 35.1 | NA | 0.84 | 0.84 | 108 | 107 | 59 | 65 |
0 | 40 | 173 | 171.4 | 32.8 | 32.4 | 41.9 | 42.4 | 1.00 | 1.00 | 116 | 126 | 71 | 79 |
0 | 33 | 163 | 160.2 | 37.9 | 37.2 | 41.7 | NA | 0.86 | 0.84 | 113 | 114 | 73 | 78 |
0 | 24 | 90 | 91.8 | 16.5 | 16.8 | NA | NA | 0.73 | 0.73 | NA | NA | 78 | 76 |
Error in ggplot(dat, aes(x = BMI, y = Fat)): could not find function "ggplot"
Traceback:
So this joint distribution has a joint cdf, \(F(x,y)\) and a continuous piecewise density function \(f(x,y)\). The joint mean is defined as \(\mu_x,\mu_y\) What about the variance? Here we need to consider the variance and covaraince between \(X\) and \(Y\).
# correlation between variables
dat2 <- dat[!is.na(dat$Fat),]
round(cov(x=cbind(dat2$BMI,dat2$Fat)),3)
paste0("variance of BMI = ",round(var(dat2$BMI),3))
paste0("variance of fat = ",round(var(dat2$Fat),3))
15.850 | 20.696 |
20.696 | 36.282 |
The covariance matrix is returned. The diagnoals return the variance of each parameter, and the off-diagnoals the covariance, indicating a positive correlation.
3.5.3 Correlation¶
Correlation and covariance are closely related. Pearson’s correlation coefficient is defined as:
So this helps us define BMI from body fat and vice versa. Examples of when this might be useful include;
Inputing missing data
Summarising many variables with one metric (more about this in the Machine learning module)
Efficient sampling of distributions, which is used in Monte Carlo Markov Chain (MCMC) estimation
3.5.4 Connections to regression modelling¶
Later sessions exploring regression modelling will provide a powerful and flexible approach to exploring and quantifying dependencies between variables.