Gamma Regression in R - Private-Projects237/Statistics GitHub Wiki

Overview

We will be summarizing information from Getting Started With Gamma Regression. The goal is to have a deep enough understanding of gamma regression to finish the analysis needed for our master's thesis.

Background Review

In stats, we covered in great detail the normal distribution curve. The curve is described by the parameters mean and standard deviation. In R, we can create a normal distribution curve and set its parameters by using the dnorm() and plot() functions.

x <- seq(from = -1, to = 5, by = .15)
plot(dnorm(x, mean = 2, sd = 1.2))

We can think of the normal distribution as representing a population. We can use the function rnorm(), which selects n number of values at random from a population with the desired mean and standard deviation. Since these values are chosen at random, their calculated mean and standard deviation will generally be close to—but not exactly equal to—the mean and standard deviation of the population.

set.seed(99)
y <- rnorm(n = 1000, mean = 2, sd = 1.2)
mean(y)
sd(y)

Introduction into Generalized Linear Models

We can now take this information and move to generalized linear model. These models build upon the concepts of regular linear regression, but now allow the outcome to use other distributions instead of normal. To use generalized linear models, we need to understand the different types of distributions and what links are. A normal distribution is referred to as guassian, while the link portion indicates if we want the data transformed or not. With this in mind, we can also estimate the mean and standard deviation of a sample by fitting an empty model.

m1 <- glm(y ~ 1, family = gaussian(link = "identity"))
s1 <- summary(m1)
s1

Empty model Output

The intercept 1.97108 is identical to the sample mean of vector y
The dispersion parameter 1.5065 is the 'estimated' variance of the data
We can take the square root of the dispersion parameter (= 1.22742) and this will produce the standard deviation of vector y.

Equation Notation

Where

$ y_{i}$ represents each observed value in vector y
$ \beta_{0}$ represents the expected value for y, in this case its mean of y or 1.97108
$\epsilon_{i}$ represents the difference between the predicted value (y-intercept) and the observed value for i
$\sigma^2$ represents the variance of the scores and the errors, in this case 1.5065, which is the dispersion parameter in the model.

For this particular model, that includes no predictors, we can technically use the errors to obtain the variance and standard deviation of the scores and vice versa. Thus, if we wanted to know the SS, we could calculate them from both scores and the errors.

Gamma Distribution

Unlike the normal distribution that contains the parameters mean ($ \mu$) and standard deviation ($ \sigma$), the gamma distribution uses the parameters shape ($ \alpha$) and scale ($ \sigma$). It is a bit confusing since shape uses the alpha greek letter while scale uses sigma, both of which already mean different things. One thing to note from gamma distributions is that they can look pretty different from each other.

curve(dgamma(x, shape = 1, scale = 2), from = 0, to = 20, 
      ylab = "",
      main = "Two gamma distributions")
curve(dgamma(x, shape = 7.5, scale = 1), from = 0, to = 20,
      add = TRUE, col = "red")
legend("topright", legend = c("shape = 1, scale = 2",
                              "shape = 7.5, scale = 1"),
       col = c("black", "red"), lty = 1)

Even though we mentioned that the parameters for gamma distributions are shape and scale, we can modify them to calculate the mean and variance. We can do this with the following notation tricks:

$$mean = \alpha\sigma$$ $$variance = \alpha\sigma^2$$ $$\alpha = \frac{mean^2}{variance}$$ $$\sigma = \frac{variance}{mean}$$

With these notations, we can define gamma distributions using means and variances (or standard deviations) instead of shapes and scales. With this knowledge, we can create a gamma distribution using the rgamma() function, and we can set the mean and standard deviation we want in advance and then calculate them from the distribution.

set.seed(999)
m <- 100 # true mean
v <- 50  # true variance
y2 <- rgamma(n = 10000, shape = (m^2)/v, scale = v/m)
mean(y2)
var(y2)

So it seems that we will have the mean and variance available to use from the scores of y, and we can use that information to calculate shape and scale from the gamma distribution.

# shape
(mean(y2)^2)/var(y2)
# scale
var(y2)/mean(y2)

Interestingly, the parameters we entered to create the gamma distribution for the most part look pretty normally distributed, with maybe slight skew to the right but I think its pretty normal. This is because as the shape ($\alpha$) parameter gets bigger, the gamma distribution converges to a normal distribution. In our example our shape parameter is pretty big (100^2)/50 = 200.

Histogram of the scores	QQPlot of the Scores

Running an Empty Gamma Regression

With the knowledge we have now, we can now run a gamma regression by using glm and setting the family parameter equal to gamma with the link equal to identity.

m2 <- glm(y2 ~ 1, family = Gamma(link = "identity"))
s2 <- summary(m2)
s2

Empty Gamma model Output

Equation Notation

Where

$ y_{i}$ represents each observed value in vector y
$ \beta_{0}$ represents the expected value for y, in this case its mean of y or 99.95558
$\epsilon_{i}$ represents the difference between the predicted value (y-intercept) and the observed value for i
$\phi$ (phi) represents the variance of the errors, in this case 0.0050971, which is the dispersion parameter in the model.
- This is interesting because running var(resid(m2)) produces the same dispersion parameter, which differs greatly from var(y2)

Along with using the fomulas from above to calculate shape, we can also use the dispersion factor (phi) to obtain the same estimate ($\hat{\alpha} $) by 1/($\phi$). The latter approach becomes the most useful since the former does not work when predictors are added into the model.

mean(y2)^2/var(y2)
shape <- 1/s2$dispersion
shape

To calculate the scale, we can divide the variance of y2 by its mean or take the intercept (mean) and divide it by the shape. Just like for scale, the latter approach will become the most useful since the former only works for empty models.

var(y2)/mean(y2)
scale <- as.numeric(coef(m2))/shape
scale

Now that we have our scale and shape parameters, we can use that to graph the distribution over the observed data using the curve() function.

Estimated Gamma Distribution Over the Distribution of Scores

Notice the created gamma distribution fits the distribution of the outcome pretty well.

Running an Empty Gamma Regression With Different Link Settings

m3 <- glm(y2 ~ 1, family = Gamma(link = "inverse"))
summary(m3)

m4 <- glm(y2 ~ 1, family = Gamma(link = "log"))
summary(m4)

Empty Gamma model Output (Inverse)	Empty Gamma model Output (Log)

Notice from these empty models the dispersion parameter has not changed compared to the "identity" link. However, what does change is the y-intercept estimate. We can convert these estimates back into the mean by:
1/coef(m3) = 99.95558
exp(coef(m4)) = 99.95558
From this information, we can agree that keeping the link function as "identity" will produce the easiest model to interpret. However, in some cases the fit with this approach may be problematic- thus these alternative link functions can get around this.

Real World Example

We will now deviate from the guide and make a comparison between typical OLS regression vs GLM with gamma distribution and a log link function. We will use the example data they provided.

URL <- 'http://static.lib.virginia.edu/statlab/materials/data/alb_homes.csv'
homes <- read.csv(file = URL)

Our goal will be to create models that can explain the variance of the outcome totalvalue. This variable represents the costs of houses. As expected, this variable will have a skewed distribution, with most house costs being at the left of the plot and a few houses being on the right (they are extremely expensive).

Distribution of the Outcome

Now let's create two empty models with different assumptions for the distribution of the data.

# Create an empty model
mod0 <- glm(totalvalue ~ 1, data = homes, 
            family = Gamma(link = "log"))

# Create an empty model (Gaussian)
mod0_guassian <- glm(totalvalue ~ 1, data = homes, 
            family = gaussian(link = "identity"))

Gaussian Distribution Assumption Model Output	Gamma Distribution Assumption Model Output

We can instantly see differences in the output from these models. The assumed distribution of the outcome is present in the Call line and mentioned in the dispersion parameter line. The estimate of the y-intercept are both different- with the guassian model producing the mean of totalvalue and the gamma model producing the log of the mean log(mean(homes$totalvalue)- this can be further showcased by exponentiating the coefficient to produce the same value exp(coef(mod0))
We can also plot the residuals of the model and compare them

Residuals of the Guassian Model	Residuals of the Gamma Model

While not perfect, we can see that the GLM with the gamma distribution and log link function produced a much better QQ-plot than Guassian.
Now let's introduce predictors into the model to see how their summary outputs compare.

One Predictor Guassian Model	One Predictor Gamma Model

Residuals of the Guassian Model 1	Residuals of the Gamma Model 1

A bit disappointing because the gamma regression still looks a bit bad, but again I still prefer it over the guassian one. But honestly, it looks a lot better when graphing using a histogram (hist). When doing so, clearly the gamma is superior.
We will now create complex models to continue the investigation of differences between the two models
Using model comparison, we created models that contain the predictors
- finsqft
- esdistrict
- age
- lotsize
- condition

Residuals of the Complex Guassian Model	Residuals of the Complex Gamma Model

It looks like a reason why the model does not fit well is due to outliers. We can use cook's distance to identify outliers and remove them from the respective models.
A popular formula to identify influential values is to use 4/(N-k-1). Read more about here

# Remove outliers using cook's distance 
N = nrow(homes)
k = length(coef(mod5)) - 2

# Calculate cooks distance
cooks_mod5 <- cooks.distance(mod5)

# Identify influential cases 4/(N-k-1)
influential_mod5 <- which(cooks_mod5 > (4 / (N - k - 1))) 

# Remove the influential cases from the data
homes2 <- homes[-influential_mod5, ]

Residuals of the Complex Guassian Model After Outlier Removal	Residuals of the Complex Gamma Model After Outlier Removal

We can clearly see now after outlier removal that the complex gamma model is superior to the guassian one. 109 influential cases were found for the Guassian while 105 were present for the Gamma, overall this is about a ~3.6% of the data removed, which is not much.

How Does Outlier Removal Influence Estimates and Model Fit

Gamma Model Output With Outliers	Gamma Model Output Without Outliers

We see that removing the outliers did increase the fit of the model (not sure if this is allowed to say). We cannot do a direct model comparison because the number of observations differ, however the AIC looks lower in the model without outliers than in the model where they were kept in. Without getting too specific, it looks like there are more significant predictors on the right side model than on the left.