02 Binomial Regression - chanchishing/Generalized-Linear-Models-and-Nonparametric-Regression GitHub Wiki

Binomial Regression

The Binomial Regression is a case of GLM and its GLM components are as follow

The Random Component

Each $Y_i$ is a independent realization of binomial distribution of getting $y_i$ success out of $n_i$ trails and $p_i$ being the probability of success.

$$ \begin{alignat*}{4} && Y_i &\stackrel i\sim Binomial(n_i,p_i)&&\ \end{alignat*} $$

Systematic Component

$$ \begin{alignat*}{4} && \eta_i &= \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots + \beta_p x_{i,p}&&\text{where $p$ is the number of predictors}\ && & &&\text{not to confuse with $p_i$ in the Random Component,}\ && & &&\text{which represent a different thing}\ \end{alignat*} $$

The Link Function

Usually in GLM, the link function, $g$, describes how the mean response, $\mu=E(Y)$ is linked to $\eta$. In the context of Binomial Distribution, $E(Y)$ means the expected no. success which is equal to $np$, (i.e. $E(Y)=np$). However, we are usually more interested in estimating the probability of getting success, $p$, than the number of trials $n$. Since $p$ can be easily deduced from $\mu$ and $n$, ($p=\frac{E(Y)} {n} = \frac{\mu} {n}$), we will use the link function to link $\eta$ to $p$ instead of $\mu$.

$$ \begin{alignat*}{4} && g(p_i) &=\eta_i && \ && &= \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots + \beta_p x_{i,p} && \text{where $p$ is the number of predictors}\ && & && \text{and $p_i$ is the probability of success} \ \end{alignat*} $$

  • The Logit function is the canonical link function for the binomial family in the GLM framework. The Logic function is defined as

$$ \begin{alignat*}{4} && g(p_i) &= ln (\frac {p_i} {1-p_i}) && \ && &= \theta && \text{Logit is equivalent to the $\theta$ parameter of Binomial Distribution in its Exponential Form}\ && &\stackrel {set} = \eta_i && \text{as we set Logit function as link function to link $p_i$ to $\eta$, $\theta$=$\eta_i$}\ && \frac {p} {1-p} &= e^{\eta_i} && \text{Note the ratio $\frac{p} {1-p}$ is also the definition of odds ratio, and it is $e^{\eta_i}$} \ && & && \text{and $ln(\frac{p} {1-p})$ is regarded as the log-odds}\ \end{alignat*} $$

  Note:

  1. The Logit function is equivalent to the $\theta$ parameter of Binomial Distribution in its Exponential Form.

  2. By rearranging terms of the Logit function, we can express $p$ (i.e. inverse of $g(p_i)$ ) in terms of $\theta$ as below. Also note $p_i$ is between $0$ and $1$ as when $\theta$ approaches to zero, $p_i$ becomes zero; And when $\theta$ approaches to $\infty$, $p$ becomes $1$.

$$ \begin{alignat*}{4} && p_i &= \frac {e^\theta} {1+e^\theta} && \text{Inverse of $g(p_i)$}\ && &= \frac {e^{\eta_i}} {1+e^{\eta_i}} && \text{$\theta$ is $\eta_i$ as we set Logit as the link function}\ \end{alignat*} $$

  • Another function that can be used as the link function for Binomial Regression is the Probit function, which is the inverse of the CDF of Normal Distribution.

$$ \begin{alignat*}{4} && g(p_i) &= \Phi^{-1}(p_i) && \ && &\stackrel {set} = \eta_i && \ \end{alignat*} $$

  Note:

  1. If we use the inverse of the Normal CDF as the link function, the inverse of $g(p_i)$ is the CDF of Normal, therefore the possible value of $p_i$ is also between 0 and 1.

Binomial Regression Parameter Estimation

We can use Maximun Likelihood Method (MLE) to estimate the parameter of Binomial Regression.

$$ \begin{alignat*}{4} &&P(Y=y_i) &= {n_i \choose y_i}\ p_i^{y_i}\ (1-p_i)^{n_i-y_i} && \text{PMF of Binomial Distribution}\ \end{alignat*} $$

The joint PMF of independent Binomial Distribution is the product of all individual Binomial PMF.

$$ \begin{alignat*}{4} &&P(\overrightarrow{y}) &= \prod\limits_{i=1}^n \left[ {n_i \choose y_i}\ p_i^{y_i}\ (1-p_i)^{n_i-y_i} \right] && \text{Joint PMF of Binomial Distribution}\ && &= \prod\limits_{i=1}^n \left[ {n_i \choose y_i}\ {\left(\frac {e^{\eta_i}} {1+e^{\eta_i}}\right)}^{y_i}\ \left(1-\frac {e^{\eta_i}} {1+e^{\eta_i}} \right)^{n_i-y_i} \right] && \text{Using Logit as the link function we have $p_i = \frac {e^{\eta_i}} {1+e^{\eta_i}}$}\ \end{alignat*} $$

Recall $\eta_i= \beta_0 + \beta_1 x_{i,1} + \beta_2 x_{i,2} + \cdots + \beta_p x_{i,p}$, the joint PMF (which is also the Likelihood function $L$) is a function of $\overrightarrow{\beta}$

$$ \begin{alignat*}{4} &&L(\overrightarrow{\beta}) &= \prod\limits_{i=1}^n \left[ {n_i \choose y_i}\ {\left(\frac {e^{\eta_i}} {1+e^{\eta_i}}\right)}^{y_i}\ \left(1-\frac {e^{\eta_i}} {1+e^{\eta_i}} \right)^{n_i-y_i} \right] && \ &&ln\ L(\overrightarrow{\beta}) &= ln \left(\prod\limits_{i=1}^n \left[ {n_i \choose y_i}\ {\left(\frac {e^{\eta_i}} {1+e^{\eta_i}}\right)}^{y_i}\ \left(1-\frac {e^{\eta_i}} {1+e^{\eta_i}} \right)^{n_i-y_i} \right] \right) && \text{Taking log on $L$ to get the log-likelihood function $ln\ L$}\ && & &&\ && &=\cdots \ \cdots &&\text{Simplifications not shown}\ && & &&\ && & &&\ && &=\sum\limits_{i=1}^n \left[ y_i\ \eta_i - n_i\ ln(1+e^{\eta_i}) + ln{n_i \choose y_i} \right] && \text{Note the log-likelhood function $ln\ L$ is a} \ && & && \text{function of $\beta$'s as $\eta_i$ is a function of $\beta$'s } \ \end{alignat*} $$

The log-likelihood function of Binomial Regression cannot be optimised (maximised) for $\hat{\vec{\beta}}$ by using analytic (Calculus) method but it can be solved (approximated) by other numerical methods.

Interpretation of Binomial Regression Parameters

We formulated the Binomial Regression using the Logit function as below :

$$ \begin{alignat*}{4} && \eta &= \beta_0 + \beta_1 x_1 + \beta_2 x_{2} + \cdots + \beta_p x_{p} && \ && &= \ln(\frac{p} {1-p}) && \ \end{alignat*} $$

Therefore the regression parameters (the $\beta$ s) can be interpreted as follow (provided that Logit function is used as the link function):

  1. $\beta_0$ is the log-odds of success when all the predictors are zero.

  2. A unit increase in $x_j$ with all other predictors held constant, increases the log-odds of success by $\beta_j$; Or, a unit increase in $x_j$ with all other predictors held constant increases the odds of success by a factor of $e^{\beta_j}$

$$ \begin{alignat*}{4} && &\text{For a 1 unit increase in $x_j$, the odds, $\eta$, becomes: } &&\ &&=&e^{\beta_0 + \beta_1 x_1 + \beta_2 x_{2} + \cdots +\beta_j(x_{j}+1) + \cdots + \beta_p x_{p}} && \ &&=&e^{\beta_0 + \beta_1 x_1 + \beta_2 x_{2} + \cdots +\beta_j x_{j} + \beta_j + \cdots + \beta_p x_{p}} && \ &&=&e^{\beta_0 + \beta_1 x_1 + \beta_2 x_{2} + \cdots +\beta_j x_{j} + ~~~~~~~~ \cdots + \beta_p x_{p}} \cdot e^{\beta_j} && \ &&=&e^\eta \cdot\ e^{\beta_j} && \text{note $e^\eta$ is the odds} \ && & && \ && & && \ && &\text{So the log-odds becomes: } &&\ && =&ln(e^\eta \cdot\ e^{\beta_j}) &&\ && =&\eta + \beta_j && \text{note $\eta$ is the log-odds} \ \end{alignat*} $$

GLM Deviance

GLM Deviance measures the goodness of fit of a GLM model — it's a generalization of the residual sum of squares used in linear regression, but adapted for the likelihood-based framework of GLMs. Specifically, GLM Residue Deviance compares the log-likelihood of the fitted model to that of a saturated model, which perfectly fits the data by using one parameter per observation. A smaller deviance implies a better fit., if we have a small difference then we have a good fit.

The factor of $-2$ is included so that, under large-sample conditions, the deviance approximately follows a chi-squared distribution, enabling hypothesis testing on model fit.

The deviance is defined as:

$$ \begin{alignat*}{4} && D_{resid}&= -2 \left[ ln\ L(fitted) - ln\ L(saturated)\right]&&\ \end{alignat*} $$

The GLM null model is a baseline model that includes only the intercept term and no predictors. It assumes that the response variable has the same expected value for all observations, typically estimating this value using the overall mean (for Gaussian models) or overall proportion (for binomial models). The null model is used to compute the null deviance, which quantifies how well this simple model fits the data:

$$ \begin{alignat*}{4} D_{null} = - 2 \left[ln\ L(null) - ln\ L(saturated) \right] \end{alignat*} $$

Comparing the null deviance to the deviance of a fitted model helps assess the explanatory power added by the predictors. And this comparison is analogous to the Sum of Squares of the Standard Linear Regression, the similarity is summarized as the table below.

Concept Linear Regression (Meaning) Formula GLM (Meaning) Formula
Total variation Total variation in $y_i$ vs. mean $\text{TSS} = \sum (y_i - \bar{y})^2$ Fit of intercept-only model $D_{\text{null}} = -2 \left[ ln\ L(\text{null}) - ln\ L(\text{saturated}) \right]$
Model error Unexplained variance $\text{RSS} = \sum (y_i - \hat{y}_i)^2$ Lack of fit of full model $D_{\text{resid}} = -2 \left[ ln\ L(\text{fitted}) - ln\ L(\text{saturated}) \right]$
Model gain Improvement due to predictors $\text{ESS} = \text{TSS} - \text{RSS}$ Deviance reduction $D_{\text{null}} - D_{\text{resid}}$
Proportion fit Proportion of variance explained $R^2 = 1 - \frac{\text{RSS}}{\text{TSS}}$ Proportion of deviance explained $R^2_{\text{pseudo}} = 1 - \frac{D_{\text{resid}}}{D_{\text{null}}}$

Deviance of Binomial Regression

In the context of Binomial Regression, we plug-in different values of $\beta$'s into the log-likelihood function of Binomial Regression to calculate the respective deviances.

  • For the log-likelihood of the Fitted model, we plug in the fitted $\beta$'s, i.e. $\hat{\vec{\beta}}$, into the log-likelilhood

  • For the log-likelihood of the Null model, there is no predictor variable, therefore $\beta_0=\eta_i$ for all $i$'s.

$$ \begin{alignat*}{4} && \beta_0&=\eta_i &&\text{$\eta_i$ is the same for all $i$'s} \ && &=ln(\frac{p_i}{1-p_i}) &&\text{$\eta_i$ is the log-odds,} \ && & &&\text{$\therefore$ log-odds is the same for all $i$'s} \ && & &&\text{$\therefore$ $p_i$ is the same for all $i$'s} \ && &=ln(\frac{p}{1-p}) &&\text{write $p_i$ as $p$ as they are the same} \ && &=ln(\frac{\overline{y}}{1-\overline{y}}) && \text{where $\overline{y}=\frac{\sum y_i}{\sum n_i}$}\ \end{alignat*} $$

  • For the log-likelihood of the Saturated model, it is a conceptual tool used as a benchmark to compare your actual model against. It uses the maximum number of parameters possible — one for each observation — so that it is over-fitted to give the reproduces the exactly the observed data. Its log-likelihood is the maximum possible for the data sampled. Because Deviances are defined as a relative measure, we’re measuring differences relative to the Saturated model, and setting the log-likelihood of a Saturated model of zero, it’s similar to setting a reference point at zero.

GLM Goodness of Fit Test

This is a goodness-of-fit test for the a single GLM fitted model, it tests whether the GLM fitted model fits well enough overall.

We have a GLM model with $p$ predictors, then the total number of parameters is $(p+1)$ (including the intercept).

It can be shown that the Residue Deviance of a GLM is a $\chi^2$-distribution with degrees of freedom $n-(p+1)$, if n is large ($n>5$) and other GLM assumptions are met (Not applicable to Binominal Regression with a Bernoulli (0/1) response):

$$ \begin{alignat*}{4} && D_{resid} & \sim \chi_{n-(p+1)}^{2} && \ \end{alignat*} $$

The null hypothesis is the model fit well (i.e. the Residue Deviance is small), if the Residue Deviance is too large (above critical value), then we have statistical evidence that the model does not fit well.