15.2 Data used in our examples

We will use a dataset that is simulated to represent data from electronic health records for 200,000 patients. The outcome we will consider is whether or not a patient is diagnosed with dementia. In this example, there is an additional complexity because patients were followed up for different amounts of time. A longer follow-up will naturally lead to a higher probability of being diagnosed with dementia. In later modules, we will encounter survival analysis which allows the aspect of time to be accounted for. For now, we will ignore this aspect.

The code below reads in the dataset and displays the first few rows.

# we load the dataset and display its first lines
dementia <- read.csv("Practicals/Datasets/Dementia/dementia2.csv")
head(dementia)
idpracpr_lcdsexagebmibmi_categoryconsultationsagegpalcohol...mortalitydate_deathtimetodementiadementiadate_dementiaend_datedobrsamplevitdlp
23189 142 08dec2009 1 53 20.4 Normal (18.5-<25) 12 50 3-6 units/day ... 0 NA 0 08dec2009 01nov1941 1 NA -0.8153054
92186 132 03feb2003 0 73 21.5 Normal (18.5-<25) 4 70 <2 units/day ... 0 NA 0 03feb2003 16jan1928 1 NA -1.2268275
187963 43 06jul2001 0 40 27.1 Overweight (25-<30) 0 40 <2 units/day ... 0 NA 0 06jul2001 18jun1961 1 NA -0.6602434
148379 215 08mar2012 1 40 20.9 Normal (18.5-<25) 3 40 <2 units/day ... 0 NA 0 08mar2012 10feb1952 1 23.22692 -0.9507329
44194 225 02feb2011 1 92 32.5 Obese class I (30-<35) 10 90 Non drinker ... 0 NA 0 02feb2011 09dec1912 1 NA 1.0403746
169915 175 02nov2011 1 55 26.3 Overweight (25-<30) 3 55 3-6 units/day ... 0 NA 0 02nov2011 06oct1946 1 NA -0.1080445

15.2.1 Exploratory analyses

The variables we will use during this session are:

  • id: a variable that identifies a patient

  • sex: a factor variable that gives the sex of the patient (\(0\) for men, \(1\) for women)

  • age: age in years of the patient at study baseline

  • bmi: Body Mass Index of the patient at study baseline

  • dementia: an indicator variable that equals \(1\) if the patient is diagnosed with dementia during follow-up, \(0\) if not.

In this session the outcome of interest is dementia diagnosis, which we will treat as a binary variable. We are interested in modelling the relationship between dementia diagnosis and age, sex and BMI. Generally, we would expect older people to have a higher risk of being diagnosed with dementia. Females typically have higher risk. The relationship with BMI is less well understood.

The code below tabulates dementia and sex and draws box-plots of age and BMI, separately by dementia diagnosis status.

# Tabulate dementia diagnosis versus sex (dementia = right-hand column)
(table<-table(dementia$sex, dementia$dementia))
prop.table(table, 1)

# Box plot of age by dementia diagnosis
par(mfrow=c(1,2))
options(repr.plot.height=4, repr.plot.width=5)
boxplot(dementia$age ~ dementia$dementia, main="Age", xlab="Dementia diagnosis", ylab="Baseline age (years)")
boxplot(dementia$bmi ~ dementia$dementia, main="BMI", xlab="Dementia diagnosis", ylab="Baseline BMI")
   
         0      1
  0 107981   1707
  1  88132   2180
   
             0          1
  0 0.98443768 0.01556232
  1 0.97586146 0.02413854
_images/15.c. Logistic Regression_3_2.png

From the output above, we see that dementia is fairly rare in this study population, with 1.6% of males receiving a dementia diagnosis during follow-up compared to a slightly higher 2.4% among females.

The box-plots show that patients who received a dementia diagnosis during follow-up generally had a much higher age at baseline, as expected. The second box-plot perhaps hints at a slightly lower BMI among those diagnosed with dementia, but there is a less evident relationship than for age.