16.7 Poisson Regression Exampleยถ

16.7.1 Example data setยถ

For the purpose of illustration, we will simulate some data and pretend it comes from a clinical trial. We generate 100 participants ( ๐‘› ) and three variables. The first is a count variable representing the number of hospital admissions (counts) a participant has had in a year and it is created from a Poisson distribution with ๐‘™๐‘Ž๐‘š๐‘๐‘‘๐‘Ž=2 . The second is a categorical variable (country) with 4 groups representing the country a participant lives in (England, Northern Ireland, Scotland, Wales) and the last is a binary variable (treatment) representing which treatment arm the participant was randomised to. Letโ€™s start with simulating the data and looking at some descriptive statistics.

## Simulate Data
set.seed(42)
n<-100
lambda<-6
counts <- rpois(n, lambda)
country <-  factor(sample(1:4, n, replace=T), levels=1:4, labels=c("England","Northern Ireland","Scotland","Wales"))
treatment <- factor(gl(2,n/2), levels=1:2, labels=c("Active Arm", "Placebo Arm"))
df <- data.frame(treatment, country, counts)

Assume we wish to model counts using a GLM with treatment and country as predictors. We already know the admissions count variable follows a Poisson distribution as we have simulated the data directly from the distribution without adding noise, therefore we know a Poisson regression is suitable. To fit the model, we call the glm() function with the family set to โ€œpoissonโ€ and use the summary command to look at the output.

Note: We have used the option family=poisson. We could be more explicit and state the link function we want R to use by replacing this with family=poisson(link=log). Try re-running the command using family=poisson(link=identity). What is this doing? Is this a sensible/useful model?

set.seed(42)
summary(m1 <- glm(counts ~ treatment + country, family=poisson, data=df))
Call:
glm(formula = counts ~ treatment + country, family = poisson, 
    data = df)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-3.2467  -0.5783   0.0477   0.6381   2.2260  

Coefficients:
                        Estimate Std. Error z value Pr(>|z|)    
(Intercept)              1.82733    0.08448  21.631  < 2e-16 ***
treatmentPlacebo Arm    -0.23187    0.08226  -2.819  0.00482 ** 
countryNorthern Ireland  0.17001    0.10822   1.571  0.11620    
countryScotland          0.14936    0.12396   1.205  0.22822    
countryWales             0.06669    0.11640   0.573  0.56670    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for poisson family taken to be 1)

    Null deviance: 130.13  on 99  degrees of freedom
Residual deviance: 120.53  on 95  degrees of freedom
AIC: 483.69

Number of Fisher Scoring iterations: 5

16.7.2 Example GLM outputยถ

The first part of the output gives information on deviance residuals. We would expect to see the deviance residuals to be approximately normally distributed if the model is correctly specified. Here we can see the median is close to 0 (0.05) and there does not appear to be any skewness as Q1 (quartile 1 = -0.58) and Q3 (quartile 3 = 0.64) have a similar distance from the median and so are the minimum and maximum.

The second part of the output gives the Poisson regression coefficients for each variable with their standard errors, z values, p-values. We interpret Poisson regression coefficients as if there was a one unit change in the predictor variable (if a continuous variable otherwise change from the reference category to the category listed) the regression coefficient tells us the effect on the logs of the expected counts (admission counts in our example), given the other variables in the model are held constant. The coefficient for treatment is -0.23 which tells use the expected log admissions count for being randomised to the active arm compared to the placebo arm is -0.23. The expected log admissions count for the other countries compared to England are all positive.

We can also see the regression estimate when all the variables in the model are evaluated at zero (or categorical reference group) and this is called the constant and labelled โ€œ(Intercept)โ€. In our model this would represent the expected log admissions count for participants in the placebo arm who live in England.

The standard errors are given which are used to calculate the z-value which in turn is used to calculate the p value. The null hypothesis for each p value is that the corresponding regression coefficient is zero given the rest of the variables in the model. The z value here is just the ratio of the coefficient to the standard error for example treatment we can see the estimate/standard error equals the z value: -0.23187/0.08226=-2.819. The z value follows a normal distribution and is tested against a two-sided alternative hypothesis that the coefficient is not equal to zero. We can see for treatment the p value is 0.005 and if we set out alpha significant level at ๐›ผ=0.05 we would reject the null hypothesis and conclude the Poisson regression coefficient for treatment is statistically different from zero, given country is in the model.

Lastly, at the bottom of the output, we have information on the residual deviance which can be used to perform a goodness of fit test for the overall model.

16.7.3 Poisson Regression Goodness of Fit Exampleยถ

At the bottom of the output we see the null deviance and residual deviance from the model. The residual deviance is 120.53 on 95 degrees of freedom (df). There are 100 observations in our model and 5 estimates which gives us 95 df (100-1df for treatment- 3df for each country โ€“ 1df for the constant) . To calculate the p-value for the deviance goodness of fit test we simply calculate the probability to the right of the deviance value for the chi-squared distribution on 95 df

pchisq(m1$deviance, df=m1$df.residual, lower.tail=FALSE)
0.0395607905595946

The null hypothesis is that our model is correctly specified. Here we can see the p value is 0.0396 which is significant if we set our level of significant at 0.05. We therefore have strong evidence to reject the null hypothesis. This result is expected as when creating the simulated data we made no relationship between any of the variables in the model, so we would expect a poor fit.