03 Poisson Regression - chanchishing/Generalized-Linear-Models-and-Nonparametric-Regression GitHub Wiki
Poisson Regression
The Poisson Regression is a case of GLM and its GLM components are as follow
The Random Component
Each $Y_i$ is a independent realization of poisson distribution of getting $y_i$ observation with rate parameter $\lambda_i$.
$$ \begin{alignat*}{4} && Y_i &\stackrel i\sim Poisson(\lambda_i)&&\ \end{alignat*} $$
Systematic Component
$$ \begin{alignat*}{4} && \eta_i &= \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots + \beta_p x_{i,p}&& \ \end{alignat*} $$
The Link Function
In the context of Poisson Distribution, $E(Y)$ means the expected no. occurrence and it is equal to $\lambda$, (i.e. $E(Y)=\lambda$). And the Canonical Link function for Poisson Regression is the log function $ln(\lambda)$.
$$ \begin{alignat*}{4} && g(\lambda_i) &= \eta_i && \ && ln(\lambda_i)&= \eta_i && \text{plug in the canonical link function}\ && &= \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots + \beta_p x_{i,p} && \ && & &&\ && &\text{The above implies $\Rightarrow \lambda = e^{\eta_i}$} &&\ && &\text{$\Rightarrow \lambda = e^{\eta_i}$} &&\ && &\text{$\ \ \ \ \ \ \ \ =\exp(\beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots + \beta_p x_{i,p})$} &&\ \end{alignat*} $$
Exposure and the Rate Parameter of Poisson Regression
The rate parameter, $\lambda$, of Poisson Distribution is a count of events divided by some measure of that unit's exposure (a particular unit of observation e.g. time period/unit area). In Poisson Regression, exposure can be handled as an offset parameter. Think of $\lambda_i$ as $\frac{y_i}{e_i}$ where $y_i$ is the occurrence count per unit of exposure $e_i$ for the $i$-th observation. Then we can express the link function in the following way, where $ln(e_i)$ is called the offset parameter:
$$ \begin{alignat*}{4} && g(\lambda_i) &=ln(\frac{y_i}{e_i}) && \ && &= ln(y_i) - ln(e_i) && \ \end{alignat*} $$
Poisson Regression Parameter Estimation
We can use Maximun Likelihood Method (MLE) to estimate the parameter of Poisson Regression.
$$ \begin{alignat*}{4} &&P(Y=y_i) &= \dfrac {e^{-\lambda_i}\lambda_i^{y_i}} {y_i !},\ \ y_i=0,1,2,\cdots&& \text{PMF of Poisson Distribution}\ \end{alignat*} $$
The joint PMF of independent Poisson Distribution is the product of all individual Poisson PMF.
$$ \begin{alignat*}{4} &&P(\overrightarrow{y}) &= \prod\limits_{i=1}^n \left[ \dfrac {e^{-\lambda_i}\lambda_i^{y_i}} {y_i !} \right] && \text{Joint PMF of Poisson Distribution}\ \end{alignat*} $$
Recall $\eta_i= \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots + \beta_p x_{i,p}$, the joint PMF (which is also the Likelihood function $L$) is a function of $\overrightarrow{\beta}$
$$ \begin{alignat*}{4} &&L(\overrightarrow{\beta}) &= \prod\limits_{i=1}^n \left[ \dfrac {e^{-\lambda_i}\ \eta_i^{y_i}} {y_i !} \right] && \ && &= \prod\limits_{i=1}^n \left[ \dfrac {\exp\{-e^{\eta_i}\}\ \exp\{ y_i \eta_i \}} {y_i !} \right] && \text{Using $ln(\lambda)$ as the link function $\Rightarrow \lambda_i=e^\eta_i$ }\ && &= \prod\limits_{i=1}^n \left[ \dfrac {\exp\{y_i \eta_i -e^{\eta_i}\}} {y_i !} \right] && \ &&ln\ L(\overrightarrow{\beta}) &= ln \left( \prod\limits_{i=1}^n \left[ \dfrac {\exp\{y_i \eta_i -e^{\eta_i}\}} {y_i !} \right] \right) && \text{Taking log on $L$ to get the log-likelihood function $ln\ L$}\ && & &&\ && &=\cdots \ \cdots &&\text{Simplifications not shown}\ && & &&\ && & &&\ && &=\sum\limits_{i=1}^n \left[ y_i\ \eta_i - e^{\eta_i} - ln(y_i !) \right] && \text{Note the log-likelhood function $ln\ L$ is a} \ && & && \text{function of $\beta$'s as $\eta_i$ is a function of $\beta$'s } \ \end{alignat*} $$
The log-likelihood function of Poisson Regression cannot be optimised (maximised) for $\widehat{\vec{\beta}}$ by using analytic (Calculus) method but it can be solved (approximated) by other numerical methods.
Interpretation of Poisson Regression Parameters
We formulated the Poission Regression using the $ln(\lambda)$ as the link function:
$$ \begin{alignat*}{4} && \eta &= \beta_0 + \beta_1 x_1 + \beta_2 x_{2} + \cdots + \beta_p x_{p} && \ && &= \ln(\lambda) && \ && \lambda &= \exp(\beta_0 + \beta_1 x_1 + \beta_2 x_{2} + \cdots + \beta_p x_{p}) && \ \end{alignat*} $$
Therefore the regression parameters (the $\beta$ s) can be interpreted as follow:
-
For the intercept, $\beta_0$: $e^{\beta_0}$ is the mean of the response when all the predictors are set to zeros.
-
For other parameters, $\beta_j$'s: $e^{\beta_j}$, is the multiplicative increase in the means of the response for a one unit increase in $x_j$ holding all other predictors constant.
$$ \begin{alignat*}{4} && &\text{For a 1 unit increase in $x_j$, $\eta$, becomes: } &&\ &&=&e^{\beta_0 + \beta_1 x_1 + \beta_2 x_{2} + \cdots +\beta_j(x_{j}+1) + \cdots + \beta_p x_{p}} && \ &&=&e^{\beta_0 + \beta_1 x_1 + \beta_2 x_{2} + \cdots +\beta_j x_{j} + \beta_j + \cdots + \beta_p x_{p}} && \ &&=&e^{\beta_0 + \beta_1 x_1 + \beta_2 x_{2} + \cdots +\beta_j x_{j} + ~~~~~~~~ \cdots + \beta_p x_{p}} \cdot e^{\beta_j} && \ &&=&e^\eta \cdot\ e^{\beta_j} && \text{note $e^\eta$ is $\lambda$} \ \end{alignat*} $$
Deviance of Poisson Regression
The Deviance of GLM models is expressed in terms of the loglikelihood function
$$ \begin{alignat*}{4} && D&= -2ln(\hat{\beta}) && \ && &= -2\sum\limits_{i=1}^n \left[ y_i\ \hat{\eta}_i - e^{\hat{\eta}_i} - ln(y_i !) \right] &&\text{for Poisson Regression, plug in the loglikelhood of Posisson Regression} \ \end{alignat*} $$
We plug-in different values of $\beta$'s into the Deviance formula on of Poisson Regression to calculate the respective deviances.
- For the Deviance Fitted model, we plug in the fitted result of $\hat{\eta}_i$, which is equal to $ln(\hat{\lambda}_i)$ into the Deviance formula
$$ \begin{alignat*}{4} && D_{p}&= -2\sum\limits_{i=1}^n \left[ y_i\ ln(\hat{\lambda}_i) - \hat{\lambda}_i - ln(y_i !) \right] && \ \end{alignat*} $$
- For the Deviance of the Null model, there is only one intercept and no predictor variable, therefore we could only estimate the rate $\hat{\lambda}$ as the average of all $y_i$, therefore $\hat{\lambda}=\overline{y}$.
$$ \begin{alignat*}{4} && ln(\lambda) &= \eta && \text{the Poisson link function}\ && ln(\overline{y}) &= \hat{\eta} && \text{$\hat{\lambda}=\overline{y}$ under Null model}\ \end{alignat*} $$
$$ \begin{alignat*}{4} && D_{null}&= -2\sum\limits_{i=1}^n \left[ y_i\ ln(\overline{y}) - \overline{y} - ln(y_i !) \right] &&\text{plug-in $\hat{\eta}=ln(\overline{y})$ for Null model} \ \end{alignat*} $$
- The Deviance of the Saturated model is a conceptual tool used as a benchmark to compare your actual model against. It uses the maximum number of parameters possible — one for each observation — so that it is over-fitted to give the reproduces the exactly the observed data. Its log-likelihood is theoretically maximum possible for the data sampled. Since in the saturated model, each individual data is its own estimator, we have $\hat{\eta_i}= y_i$ for each $i$.
$$ \begin{alignat*}{4} && D_{sat}&= -2\sum\limits_{i=1}^n \left[ y_i\ ln(y_i) - y_i - ln(y_i !) \right] && \ \end{alignat*} $$
- The Residual Deviance measure is the difference of Deviance of the Fitted Model and the Deviance of the Saturated Model. It can be shown that the Residual Deviance of a GLM is a $\chi^2$ distribution with degrees of freedom $n-(p+1)$
$$ \begin{alignat*}{4} && D_{resid} &= D_{p} - D_{sat} && \ && &= \left \{ -2\sum\limits_{i=1}^n \left[ y_i\ ln(\hat{\lambda_i}) - \hat{\lambda_i} - ln(y_i !) \right] \right \} - \left \{-2\sum\limits_{i=1}^n \left[ y_i\ ln(y_i) - y_i - ln(y_i !) \right] \right \}&& \ && &= \left \{ -2\sum\limits_{i=1}^n \left[ y_i\ ln(\hat{\lambda_i}) - \hat{\lambda_i} - ln(y_i !) \right] \right \} + \left \{2\sum\limits_{i=1}^n \left[ y_i\ ln(y_i) - y_i - ln(y_i !) \right] \right \}&& \ && & && \ && &= \cdots \text{log simplification not shown} \cdots && \ && & && \ && &= 2\sum\limits_{i=1}^n \left[ y_i\ ln(\dfrac{y_i}{\hat{\lambda}_i})-(y_i-\hat{\lambda}_i) \right]&& \ && & && \ && &\sim \chi^2\left(n-(p+1)\right) \end{alignat*} $$
- In can be shown that the Residual Deviance of an individual data point $i$, $d_i$, of the fitted model is as below and $d_i$'s asymptotically follow the $Normal(0,1)$ distribution when the sample size is large and number of count is high:
$$ \begin{alignat*}{4} && d_i &= sign(y_i-\hat{\lambda}_i)\ \sqrt{2\ ( ln(\dfrac{y_i}{\hat{\lambda}_i})-(y_i-\hat{\lambda}_i))} && \ \end{alignat*} $$
Therefore if we plot $d_i$ against $\hat{\eta}_i$ and the deviance is roughly symmetric and centered around 0 and no strong structure (funnel shape, curvature, or strong trend) in the plot, it is a good indicator of we have a good fit.
GLM Goodness of Fit Test
We can make use of the fact the the Residual Deviance is a $\chi^2$-distribution to test test the goodness-of-fit test of a GLM model.
$$ \begin{alignat*}{4} && D_{resid} & \sim \chi_{n-(p+1)}^{2} && \ \end{alignat*} $$
$$ \begin{alignat*}{4} && H_0 &: \text{The model with $p$ parameters fits well enough} && \text{vs}\ && H_1 &: \text{The model with $p$ parameters does not fit well enough} && \ \end{alignat*} $$
If there is a large deviance difference between the fitted model and the 'prefect' (yet, overfitted) saturated model, it means there is plenty of room to improvement in the fit (e.g. adding additional predictors that explain the variability better). So we conduct upper-tail test using the $\chi^s$ statistics of the residue deviance, if the $\chi^2$-statistics is large (p-value smaller than $\alpha$), we reject the null hypothesis the model with $p$ parameter adequately fits the data.
Pearson's $\chi^2$ Test
Pearson's $\chi^2$ Test is an alternative to the Deviance test of GLM. The general form of the Pearson's $\chi^2$ statistics with degrees of freedom $n-(p+1)$ for a GLM is as below. If the Pearson's $\chi^2$-statistics is large (p-value smaller than $\alpha$), we reject the null hypothesis the model with $p$ parameter adequately fits the data.
$$ \begin{alignat*}{4} && \chi^2\left(n-(P+1)\right) &= \sum\limits_{i=1}^n \dfrac{(o_i-E_i)^2}{E_i}&& \ && &\text{$o_i$ is the number (count) of observation} && \ && & \text{$E_i$ is expectation of the count of the assumed distribution of the regression} && \ \end{alignat*} $$
$$ \begin{alignat*}{4} && H_0 &: \text{The model with $p$ parameters fits well enough} && \text{vs}\ && H_1 &: \text{The model with $p$ parameters does not fit well enough} && \ \end{alignat*} $$
Therefore the Pearson's $\chi^2$-statistics for a Poisson Regression is
$$ \begin{alignat*}{4} && \chi^2\left(n-(P+1)\right) &= \sum\limits_{i=1}^n \dfrac{(y_i-\hat{\lambda}_i)^2}{\hat{\lambda}_i}&& \ \end{alignat*} $$
For an individual data-point, $i$, of the regression, we can have the Pearson's Residual as below and $p_i$'s follow the $Normal(0,1)$ distribution when the sample size is large and number of count is high:
$$ \begin{alignat*}{4} && p_i &= \dfrac{(y_i-\hat{\lambda}_i)^2}{\sqrt{\hat{\lambda}_i}}&& \ \end{alignat*} $$
Overdispresion
In GLM, we will have overdispersion when the variance of the count response is greater than the mean and it is needed to check in doing Poisson Regression as the underlying assumed Poisson distribution mandated that $E(Y_i)=Var(Y_i)=\lambda_i$ and the assumption is often violated:
$$ \begin{alignat*}{4} && Var(Y_i) &> E(Y_i) && \ \end{alignat*} $$
Real overdispersion:
-Zero Inflation- There are more zero values in the response than the expected from the assumed underlying distribution (e.g. insurance claims)
-Spatial/temporal correlation: Response is dependent on spatial/temporal predictors that is not included in the model
Apparent overdispersion:
-Outliers: outliers due to error in measurements/data collection
-Missing predictors: some explanatory variable is missed in the model
-Bad link function: wrong link function is used
Detection of Overdispersion
Generalize the single parameter model of Poisson Regression to a 2 parameters model with an additional parameter, $\phi$, which is known as the dispersion parameter. If $\phi$ is equal to $1$, the variance is $\lambda$ it is the same as a standard Poisson Regression model. If $\phi$ is greater than $1$, the variance is inflated to account for overdispersion.
$$ \begin{alignat*}{4} && E(Y_i) = \lambda_i && \ && Var(Y_i) = \lambda_i\ \phi && \ \end{alignat*} $$
We can estimate $\phi$ using the observed Pearson's $\chi^2$ -statistics of sample by dividing the Pearson's $\chi^2$-statistics by its degrees of freedom as below, if $\hat{\phi}$ is greater than $1$ we have evidence that there is overdispersion.
$$ \begin{alignat*}{4} && \hat{\phi} = \dfrac {\sum\limits_{i=1}^n \dfrac{(y_i-\hat{\lambda}_i)^2}{\hat{\lambda}_i}} {n-(p+1)}&& \ \end{alignat*} $$
To tackle overdispersion we can:
- use Quasilikelihood methods to estimate the dispersion parameter. We can adjust the standard errors of the estimated parameters, $\widehat{se(\hat{\beta}_j)}$'s, by multiplying the square root of the estimated dispersion parameter. Note the Quasilikelihood methods does not change the estimates of the regression parameter it only adjust the standard error of the estimated parameters.
$$ \begin{alignat*}{4} && \widehat{se_{adj}(\hat{\beta}_j)} = \widehat{se(\hat{\beta}_j)} \sqrt{\hat{\phi}}&& \ \end{alignat*} $$
- Use Negative Binomial Regression, unlike the Quasilikelihood methods the Negative Binomial Regression fit the data with a distribution with true mean and dispersion parameters so that we can estimate the $\beta$'s using the true MLE method.