16.8 Common Problems in Poisson Regression

16.8.1 Problems

There are two frequent common problems when applying Poisson Regression to count data and both are caused by the deviations from the Poisson distribution assumptions. The first problem is overdispersion and the second is zero inflation.

16.8.2 Overdispersion

Overdispersion happens with then the variance is no longer equal to the mean but larger which violates the Poisson distribution principle. There are two main ways to handle overdispersion, the first is through using a negative binomial distribution (not covered here) instead and the second is to implement something called a quasi-likelihood through a GLM also called a Quasi-Poisson regression.

16.8.3 Quasi-Poisson regression

A Quasi-Poisson regression is often fitted to handle over-dispersion, it uses the same mean regression function and variance function from Poisson regression but allows the dispersion parameter \(\phi\) to be unrestriced from 1. In Poisson regression \(\phi\) is assumed to be fixed at 1 to make the mean and variance equal, in Quasi-Poisson regression \(\phi\) is not fixed and is estimated from the data. Quasi- Poisson regression leads to the same coefficient estimates as the Poisson regression model but inference are adjusted for the over-dispersion through the standard errors. To run a Quasi-Poisson regression in R we just tell the glm() function that the family is “quasipoisson”

15.3.5 Zero inflation

Zero inflation happens when the distribution contains a large number of zero’s. For example, if you were to count how many occasions people drank alcohol in a month but included a large number of non-drinkers you will expect to have multiple counts of 0. A Zero-Inflated Poisson (ZIP) distribution can be thought of being generated by two processes, the first generates zeros and the second is generated by the Poisson distribution (which will contain zeros). The two processes look like this:

\(P[\mathbf{Y}=0] = \pi (1-\pi)e^{- \lambda }\),

\(P[\mathbf{Y}=k] = (1-\pi)\frac{\lambda^{k}e^{-\lambda}}{k!}\),

Where \(k\) is a non-negative integer value, \(\lambda\) is the expected Poisson count and \(\pi\) is the probability of extra zeros. The mean of a ZIP is \((1-\pi)\lambda\) and the variance is \(\lambda (1-\pi) (1+\pi \lambda)\).

Unfortunately the glm() function is incapable of running a ZIP regression to run, you will need to use the “pscl” package which fits a GLM with a binomial logit link to predict the excess zeros and a GLM with a Poisson log link to model the rest of the distribution.