15.9 Model diagnostics¶
This material is not examinable and is provided for your information.
Many model diagnostics are available for logistic regression models. We touch on a few very briefly here.
15.9.1 Goodness-of-fit¶
Deviance¶
The deviance of a model \(M\) is a measure of the goodness-of-fit of the model. It is defined as
where \(l_M\) is the log-likelihood of model \(M\) and \(l_S\) is the log-likelihood of the saturated model (one which uses the maximum possible number of parameters without redundancies; this is the model with the best possible fit).
In general, higher values of deviance indicate worse model fit to the data. Two deviance statistics are often produced in output following logistic regression:
Null deviance: the deviance computed for the null model, i.e. the minimal model containing only an intercept.
Residual deviance: the deviance computed for the model that has just been estimated.
Note
When computing deviances of different models for the same dataset, the log-likelihood of the saturated model \(l_S\) is constant. Therefore, statistical software (including the output from
glm
) often provides the deviance in a simplified form: as \(-2 l_M\).
Akaike information criterion¶
The Akaike information criterion (AIC) quantifies model fit as a function of the likelihood and the number of parameters being estimated. It is defined as \(AIC = 2k - 2 l(\hat{\beta})\) where \(k\) is the number of parameter in the model and \(l(\hat{\beta})\) the log-likelihood of the model computed at the estimated parameter values \(\hat{\beta}\).
The AIC is mainly used as a way to compare different models. The best model, in the scale of the AIC, is the one with the lowest AIC. (Note that sometimes, contrarily to the glm
package, the AIC is computed as \(AIC = -2k + 2l(\hat{\beta})\) in which case the best model would be the one with the highest AIC value.)
The AIC is actually minus the sum of the deviance and twice the number of the parameters. By including the number of parameters, the AIC penalizes models that have too many parameters, thus avoiding the selection of overfitted models.
McFadden pseudo-\(R^2\)¶
For the linear regression model, the coefficient of determination (\(R^2\)) measures how much variability is explained by the model.
For the logistic regression model, several generalization of the \(R^2\) measure have been proposed. Here, we will focus on the McFadden’s pseudo-\(R^2\). The McFadden \(R^2\) is defined as follow:
where \(l_M\) is the log-likelihood of the estimated model and \(l_0\) is the log-likelihood of the null model (containing an intercept only).
The rationale behind this measure is that when the estimated model does not explain correctly the variability, its log-likelihood will be close to the null log-likelihood so that the ratio will be close to \(1\) and the McFadden’s pseudo-\(R^2\) close to \(0\). Conversely, when the model correctly explains the variability of the model, the likelihood will be close to \(1\) and therefore \(l_M\) will be close to \(0\) so that the McFadden’s pseudo-\(R^2\) will be close to \(1\). However, when applied to a classic linear regression model, the McFadden’s pseudo-\(R^2\) is not equivalent to the classic \(R_2\).
The Hosmer-Lemeshow test¶
The Hosmer-Lemeshow test is a classic approach to assess the goodness-of-fit of a logistic regression model. The rationale of this test is to divide the vector of predicted probabilites \(\hat{\pi} = (\hat{\pi}_i)\) with \(i=1,\dots,n\) into \(G\) groups, e.g. based on the quantiles, with \(n_g\) subjects. In each group, the mean of the predicted probabilites \(\bar{\pi}_g\) is compared to the proportion of observed success. Formally, for the group \(g=1,\dots,G\), we have that
the observed values are
for Y = 1: \(y_g\)
for Y = 0: \(n_g - y_g\)
the predicted values are
for Y = 1: \(\bar{\pi}_gn_g\)
for Y = 0: \(n_g(1 - \bar{\pi}_g)\)
The Hosmer-Lemeshow test statistics is based on the chi-square statistics computed over all groups and all possible values for \(Y\)
and has been shown to follow asymptotically a \(\chi^2\) distribution with \(g-2\) degrees of freedom under the null hypothesis of a correctly specified model. However, we insist on the fact that this test if often criticized for several reasons. First it is known to have low power. Secondly, its results can be sensible to the choice of the number of groups \(G\) and this is even worst for small sample sizes.
The Hesmer-Lemeshow test statistics has not been implemented into the glm
package but is available on the ResourceSelection
package through the hoslem.test
function.