Generalized Additive Models - PennBBL/groupAnalysis GitHub Wiki
This entry addresses the use of Generalized Additive Models using gam() from the mgcv package in R. The tools shown in here should only be used to analyze cross-sectional data, or datasets with no repeated measures. If you want to analyze a dataset containing repeated measures, please take a look at the entry for Generalized Additive Mixed Effects Models.
The gam() function is useful since it allows for non-linear predictor modeling by the use of splines and can be a powerful tool when analyzing a dataset with non-linear effects. If you want to learn about how to address whether your dataset is linear, please take a look at the entry on testing non-linearity.
The best way to learn about using GAMs is by example, so I will provide the relevant code for the different tools within GAM. The following lines of code address the basic use of a generalized additive model. The notation is similar to fitting any type of generalized linear model, the only main difference is that we can specify the non-linear terms using the spline, or s(), function under the regression formula. This is a good example of that:
Create dummy data needed for these examples
set.seed(1)
x1 <- rnorm(100, 20, 3)
x2 <- rnorm(100, 5, 2)
x2.2 <- (x2)^2
error <- rnorm(100, 0, 10)
y <- 3*x1 + 10*x2.2 + error
Specifying the model:
model <- gam(y ~ x1 + s(x2))
summary(model)
Family: gaussian
Link function: identity
Formula:
y ~ x1 + s(x2)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 278.1947 8.1994 33.929 < 2e-16 ***
x1 3.0458 0.4002 7.611 2.43e-11 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x2) 7.065 7.065 5790 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.998
lmer.REML = 763.07 Scale est. = 106.55 n = 100
We can fit fixed or penalized splines using the fx argument in the s() function. The mechanism behind gam is to calculate different splines of your data and fit them at the same time to find which coefficients are significant, so the final product is the sum of several terms of the same variable (i.e. having x1 + (x1)^2 + (x1)^3 in a linear regression).
Penalized splines are the default here and tend to be more useful since they allow you to find the optimal number of needed terms in order to find the best overall fit for the data, giving you an overall spline with less degrees of freedom. Fixed splines will fit all the different terms generated and will be costly when it comes to the number of degrees of freedom.
This is an example of fitting a fixed spline on the same data, notice the difference in the degrees of freedom for the smooth terms:
summary(model)
Family: gaussian
Link function: identity
Formula:
y ~ x1 + s(x2, fx = T)
Parametric coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 300.1657 6.7852 44.239 < 2e-16 ***
x1 2.8795 0.3457 8.331 9.76e-13 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Approximate significance of smooth terms:
edf Ref.df F p-value
s(x2) 10 10 4443 <2e-16 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
R-sq.(adj) = 0.998 Deviance explained = 99.8%
GCV = 121.23 Scale est. = 106.68 n = 100
Fitting a categorical by non-linear interaction
You can fit a categorical by non-linear interaction in GAM just the say way you would in a regular linear regression, except that you need to pass your categorical value as an ordered variable. Please note that this method of fitting in gam only supports two groups. In the case with more than two groups this will break and you need to rely on a different model.
If you pass the group as categorical variable, then gam will give you two individual splines and will not test for the interaction term but only for the significance of the two splines individually (i.e. you won't be able to tell if there's a significant group non-linear interaction, only that there is a non-linear trend in each group).
This is an example of how to fit these interactions:
set.seed(1)
x2 <- rnorm(100, 5, 2)
error <- rnorm(100,0,10)
y1 <- 10*(x2)^2 + error
y2 <- (x2)^3 + error
x2 <- c(x2,x2)
y <- c(y1,y2)
category <- ordered(c(rep(1,100), rep(2,100)))
model2 <- gam(y ~ s(x2) + s(x2, by=category))
summary(model2)
Fitting a linear by non-linear interaction
In order to fit a linear by non-linear interaction, we need to rely on parametric bootstrap tests in order to obtain a valid p-value. An example on how to do this can be found on the Model Comparison wiki entry here.
You should not try to implement linear by non-linear interactions without using the pbkrtest package since the p-values from gam are not reliable in this case.
Plotting GAMs
There's two different ways of plotting GAMs, we can use vis.gam or plot.gam. Here are some example outputs:
Loading Data
set.seed(1)
x1 <- rnorm(100, 20, 3)
x2 <- rnorm(100, 5, 2)
x2.2 <- (x2)^2
error <- rnorm(100, 0, 10)
y <- 3*x1 + 10*x2.2 + error
model <- gam(y ~ x1 + s(x2))
Using vis.gam()
We start off by using vis.gam(). This function will create 3D surfaces of two regressors of your choice against the prediction. You can adjust which regressors you want to plot with the "view" argument within the function.
vis.gam(model, view=c("x1","x2")
This is an example of the type of plots you can generate with vis.gam()
Using plot.gam(model)
The second way of plotting gam models is using plot.gam(). This function creates 2D plots of the smoothed regressors against the predictor. It will automatically create all plots once you call the function, and you will need to press enter several times to see them all.
plot.gam(model)
This is an example of the type of plots that can be generated using plot.gam
Reporting Results
The parametric coefficients can be reported in a similar fashion to a linear regression. I haven't been able to find any documentation of a standard way to report the results. However, in the past we have reported the p-value of the smoothing splines.