Count data and GLMs: choosing among Poisson, negative binomial, and zero-inflated models
Ecologists commonly collect data representing counts of organisms. Generalized linear models (GLMs) provide a powerful tool for analyzing count data.1 The starting point for count data is a GLM with Poisson-distributed errors, but not all count data meet the assumptions of the Poisson distribution. Thus, we need to test if the variance is greater than the mean or if the number of zeros is greater than expected. Below, we will walk through the basic steps to determine which GLM to use to analyze your data.
First, we will define a few of the variables that are used repeatedly throughout the subsequent code.2 We are using an unrealistic sample size for most ecological studies because we do not want to be misled by a strange draw from any of the distributions.
n = 100 # No. of samples per treatment mean.
mean.A = 10 # Mean count for treatment A mean.
mean.B = 5 # Mean count for treatment B
nd = data.frame(Trt = c("A","B")) # Used in predict function
Poisson data
We generate random variates from a Poisson distribution with the rpois
function. Because mean equals variance in a Poisson distribution, we only need to specify the number of random variates and the expected mean of the distribution.
data.pois = data.frame(Trt = c(rep("A", n), rep("B", n)),
Response = c(rpois(n, mean.A), rpois(n, mean.B)))
Poisson model
Now, we run the GLM and set the error distribution to Poisson.
model.pois = glm(Response ~ Trt, data = data.pois, family = poisson)
summary(model.pois)
##
## Call:
## glm(formula = Response ~ Trt, family = poisson, data = data.pois)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.1496 -0.9504 -0.0378 0.4824 3.2260
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.31451 0.03143 73.63 <2e-16 ***
## TrtB -0.71311 0.05481 -13.01 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 424.22 on 199 degrees of freedom
## Residual deviance: 244.04 on 198 degrees of freedom
## AIC: 991.54
##
## Number of Fisher Scoring iterations: 4
We test for goodness-of-fit of the model with a chi-square test based on the residual deviance and degrees of freedom.
1 - pchisq(summary(model.pois)$deviance, summary(model.pois)$df.residual)
## [1] 0.01432758
The GOF test indicates that the Poisson model fits the data (p > 0.05). If this were your actual data, you could breathe a sigh of relief because you could stop here. Well, not quite here. You will still want to use the model to predict mean counts for each treatment and standard errors for each parameter.
cbind(nd,
Mean = predict(model.pois, newdata=nd, type="response"),
SE = predict(model.pois, newdata=nd, type="response", se.fit = TRUE)$se.fit)
## Trt Mean SE
## 1 A 10.12 0.3181193
## 2 B 4.96 0.2227079
Because we used a large sample size, the predicted means are similar to the expected means of 10 and 5.
Negative binomial data
Next we will use the MASS
package to generate random deviates from a negative binomial distribution, which involves a parameter, theta
, that controls the variance of the distribution.
library(MASS)
data.nb = data.frame(Trt = c(rep("A", n), rep("B", n)),
Response=c(rnegbin(n, mean.A, 5), rnegbin(n, mean.B, 5)))
Poisson model
We first test if a Poisson model fits this data. Remember that we are trying to simulate the steps you would take if you did not know the properties of the distribution (beyond knowing that you have integers bound at zero and infinity).
model.pois.2 = glm(Response ~ Trt, data = data.nb, family = poisson)
summary(model.pois.2)
##
## Call:
## glm(formula = Response ~ Trt, family = poisson, data = data.nb)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -3.6917 -1.4054 -0.3620 0.8198 3.4612
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.31550 0.03142 73.7 <2e-16 ***
## TrtB -0.69019 0.05437 -12.7 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 681.61 on 199 degrees of freedom
## Residual deviance: 510.72 on 198 degrees of freedom
## AIC: 1239.4
##
## Number of Fisher Scoring iterations: 5
As expected, the Poisson model does not fit the data (p < 0.05).
1 - pchisq(summary(model.pois.2)$deviance, summary(model.pois.2)$df.residual)
## [1] 0
Nonetheless, let’s take a look at the predicted values.
cbind(nd,
Mean = predict(model.pois.2, newdata = nd, type = "response"),
SE = predict(model.pois.2, newdata = nd, type = "response", se.fit = T)$se.fit)
## Trt Mean SE
## 1 A 10.13 0.3182766
## 2 B 5.08 0.2253886
Make a note of the SEs in this output because we will come back to this after we run a GLM based on a negative binomial error distribution.
Negative binomial model
model.nb = glm.nb(Response ~ Trt, data = data.nb)
summary(model.nb)
##
## Call:
## glm.nb(formula = Response ~ Trt, data = data.nb, init.theta = 4.796214,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.6322 -0.8423 -0.2079 0.4929 2.0477
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.31550 0.05543 41.776 < 2e-16 ***
## TrtB -0.69019 0.08441 -8.176 2.93e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(4.7962) family taken to be 1)
##
## Null deviance: 278.72 on 199 degrees of freedom
## Residual deviance: 211.37 on 198 degrees of freedom
## AIC: 1119.6
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 4.796
## Std. Err.: 0.818
##
## 2 x log-likelihood: -1113.607
The model estimates the dispersion parameter at about the value that we set for theta
(i.e., 5) when generating random variates.
1 - pchisq(summary(model.nb)$deviance, summary(model.nb)$df.residual)
## [1] 0.2449219
The GOF test indicates that the negative binomial model fits the data.
cbind(nd,
Mean = predict(model.nb, newdata = nd, type = "response"),
SE = predict(model.nb, newdata = nd, type="response", se.fit = T)$se.fit)
## Trt Mean SE
## 1 A 10.13 0.5614748
## 2 B 5.08 0.3234282
Here you see the ‘danger’ of ignoring overdispersion in the Poisson model. The SE estimates are lower for the Poisson model than for the negative binomial model, which increases the likelihood of incorrectly detecting a significant treatment effect in the Poisson model.
Zero-inflated Poisson data
Lastly, we will add more more layer of complication to the story. If you have lots of zeros in your data, and have determined that Poisson and negative binomial models do not fit your data well, then you should turn to zero-inflated models with either Poisson or negative binomial error distributions. We can use the VGAM
package to generate random variates from a zero-inflated Poisson distribution using the rzipois
function. The 3rd argument to the rzipois
function specifies the probability of drawing a zero beyond the expected number of zeros for a Poisson distribution with the specified mean. Here were are introducing a relatively small proportion of extra zeros and the same proportion for each treatment.
library(VGAM)
## Loading required package: stats4
## Loading required package: splines
data.zip = data.frame(Trt = c(rep("A", n), rep("B", n)),
Response = c(rzipois(n, mean.A, 0.2), rzipois(n, mean.B, 0.2)) )
Poisson model
We first fit the Poisson model.
model.pois.3 = glm(Response ~ Trt, data = data.zip, family = poisson)
summary(model.pois.3)
##
## Call:
## glm(formula = Response ~ Trt, family = poisson, data = data.zip)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4.0817 -1.3532 0.2291 1.0565 4.9748
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.11986 0.03465 61.18 <2e-16 ***
## TrtB -0.79279 0.06207 -12.77 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 901.13 on 199 degrees of freedom
## Residual deviance: 724.96 on 198 degrees of freedom
## AIC: 1318.2
##
## Number of Fisher Scoring iterations: 5
The Poisson model does not fit the data.
1 - pchisq(summary(model.pois.3)$deviance, summary(model.pois.3)$df.residual )
## [1] 0
The Poisson model also does not predict the correct mean counts.
cbind(nd,
Mean = predict(model.pois.3, newdata = nd, type = "response"),
SE = predict(model.pois.3, newdata = nd, type = "response", se.fit = T)$se.fit)
## Trt Mean SE
## 1 A 8.33 0.2886174
## 2 B 3.77 0.1941647
Negative binomial model
Next, we fit a negative binomial model.
model.nb.2 = glm.nb(Response ~ Trt, data = data.zip)
summary(model.nb.2)
##
## Call:
## glm.nb(formula = Response ~ Trt, data = data.zip, init.theta = 1.66980196,
## link = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.44488 -0.63375 0.09263 0.46263 2.27218
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.11986 0.08479 25.002 < 2e-16 ***
## TrtB -0.79279 0.12582 -6.301 2.96e-10 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for Negative Binomial(1.6698) family taken to be 1)
##
## Null deviance: 296.90 on 199 degrees of freedom
## Residual deviance: 257.44 on 198 degrees of freedom
## AIC: 1120.8
##
## Number of Fisher Scoring iterations: 1
##
##
## Theta: 1.670
## Std. Err.: 0.271
##
## 2 x log-likelihood: -1114.840
The negative binomial model does not fit the data.
1 - pchisq(summary(model.nb.2)$deviance, summary(model.nb.2)$df.residual)
## [1] 0.002851282
And does not predict the correct means.
cbind(nd,
Mean = predict(model.nb.2, newdata = nd, type = "response"),
SE = predict(model.nb.2, newdata = nd, type = "response", se.fit = T)$se.fit)
## Trt Mean SE
## 1 A 8.33 0.7062943
## 2 B 3.77 0.3504530
Zero-inflated Poisson models
We load the pscl
package to fit the zero-inflated model. First, we fit a model where we assume that the probability of zero is the same for both treatments (with ~ Trt|1
).
library(pscl)
## Classes and Methods for R developed in the
## Political Science Computational Laboratory
## Department of Political Science
## Stanford University
## Simon Jackman
## hurdle and zeroinfl functions by Achim Zeileis
model.zip = zeroinfl(Response ~ Trt|1, data = data.zip)
summary(model.zip)
##
## Call:
## zeroinfl(formula = Response ~ Trt | 1, data = data.zip)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.5936 -0.7083 0.1979 0.7286 4.5984
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.31828 0.03465 66.9 <2e-16 ***
## TrtB -0.71176 0.06297 -11.3 <2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.3136 0.1747 -7.519 5.51e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 5
## Log-likelihood: -486.8 on 3 Df
The model output indicates that there are significantly more zeros than expected for a Poisson distribution.
cbind(nd,
Count = predict(model.zip, newdata = nd, type = "count"),
Zero = predict(model.zip, newdata = nd, type = "zero"))
## Trt Count Zero
## 1 A 10.158215 0.2118872
## 2 B 4.985456 0.2118872
The zero-inflated model predicts the correct mean counts and probability of zero. If we fit a zero-inflated model to test a treatment effect for both the counts and the zeros (with ~ Trt|Trt
),
model.zip.2 = zeroinfl(Response ~ Trt|Trt, data = data.zip)
summary(model.zip.2)
##
## Call:
## zeroinfl(formula = Response ~ Trt | Trt, data = data.zip)
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.7162 -0.7376 0.1380 0.7561 4.5705
##
## Count model coefficients (poisson with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.31828 0.03465 66.90 <2e-16 ***
## TrtB -0.71033 0.06283 -11.31 <2e-16 ***
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.5166 0.2603 -5.825 5.7e-09 ***
## TrtB 0.3904 0.3513 1.111 0.266
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Number of iterations in BFGS optimization: 7
## Log-likelihood: -486.2 on 4 Df
we see that there are significantly more zeros than expected, but that the probability of zero is not significantly different between the two treatments.
cbind(nd,
Count = predict(model.zip.2, newdata = nd, type = "count"),
Zero = predict(model.zip.2, newdata = nd, type = "zero") )
## Trt Count Zero
## 1 A 10.158143 0.1799684
## 2 B 4.992544 0.2448738
Zero-inflated negative binomial model
We can test for overdispersion in the count part of the zero-inflated model by specifying a negative binomial distribution.
model.zip.3 = zeroinfl(Response ~ Trt|1, data = data.zip, dist = "negbin")
summary(model.zip.3)
##
## Call:
## zeroinfl(formula = Response ~ Trt | 1, data = data.zip, dist = "negbin")
##
## Pearson residuals:
## Min 1Q Median 3Q Max
## -1.5717 -0.7001 0.1943 0.7219 4.5528
##
## Count model coefficients (negbin with log link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.31826 0.03625 63.954 < 2e-16 ***
## TrtB -0.71274 0.06495 -10.975 < 2e-16 ***
## Log(theta) 4.68242 1.75749 2.664 0.00772 **
##
## Zero-inflation model coefficients (binomial with logit link):
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -1.3159 0.1751 -7.516 5.64e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Theta = 108.0314
## Number of iterations in BFGS optimization: 8
## Log-likelihood: -486.6 on 4 Df
The estimated theta parameter is not significant indicating that the zero-inflated Poisson model is appropriate.
As mentioned previously, you should generally not transform your data to fit a linear model and, particularly, do not log-transform count data.↩︎
Click here to get all code from this post in a single file.↩︎