15.2 Data used in our examples¶

We will use a dataset that is simulated to represent data from electronic health records for 200,000 patients. The outcome we will consider is whether or not a patient is diagnosed with dementia. In this example, there is an additional complexity because patients were followed up for different amounts of time. A longer follow-up will naturally lead to a higher probability of being diagnosed with dementia. In later modules, we will encounter survival analysis which allows the aspect of time to be accounted for. For now, we will ignore this aspect.

The code below reads in the dataset and displays the first few rows.

# we load the dataset and display its first lines
dementia <- read.csv("Practicals/Datasets/Dementia/dementia2.csv")
head(dementia)

id	prac	pr_lcd	sex	age	bmi	bmi_category	consultations	agegp	alcohol	...	timetodementia	end_date	dob	rsample	vitd	lp
23189	142	08dec2009	1	53	20.4	Normal (18.5-<25)	12	50	3-6 units/day	...	NA	08dec2009	01nov1941	1	NA	-0.8153054
92186	132	03feb2003	0	73	21.5	Normal (18.5-<25)	4	70	<2 units/day	...	NA	03feb2003	16jan1928	1	NA	-1.2268275
187963	43	06jul2001	0	40	27.1	Overweight (25-<30)	0	40	<2 units/day	...	NA	06jul2001	18jun1961	1	NA	-0.6602434
148379	215	08mar2012	1	40	20.9	Normal (18.5-<25)	3	40	<2 units/day	...	NA	08mar2012	10feb1952	1	23.22692	-0.9507329
44194	225	02feb2011	1	92	32.5	Obese class I (30-<35)	10	90	Non drinker	...	NA	02feb2011	09dec1912	1	NA	1.0403746
169915	175	02nov2011	1	55	26.3	Overweight (25-<30)	3	55	3-6 units/day	...	NA	02nov2011	06oct1946	1	NA	-0.1080445

Statistics for Health Data Science

15.2 Data used in our examples¶

15.2.1 Exploratory analyses¶