Generalized Additive Models - PennBBL/groupAnalysis GitHub Wiki

This entry addresses the use of Generalized Additive Models using gam() from the mgcv package in R. The tools shown in here should only be used to analyze cross-sectional data, or datasets with no repeated measures. If you want to analyze a dataset containing repeated measures, please take a look at the entry for Generalized Additive Mixed Effects Models.

The gam() function is useful since it allows for non-linear predictor modeling by the use of splines and can be a powerful tool when analyzing a dataset with non-linear effects. If you want to learn about how to address whether your dataset is linear, please take a look at the entry on testing non-linearity.

The best way to learn about using GAMs is by example, so I will provide the relevant code for the different tools within GAM. The following lines of code address the basic use of a generalized additive model. The notation is similar to fitting any type of generalized linear model, the only main difference is that we can specify the non-linear terms using the spline, or s(), function under the regression formula. This is a good example of that:

Create dummy data needed for these examples

set.seed(1)
x1 <- rnorm(100, 20, 3)
x2 <- rnorm(100, 5, 2)
x2.2 <- (x2)^2
error <- rnorm(100, 0, 10)
y <- 3*x1 + 10*x2.2 + error

Specifying the model:

model <- gam(y ~ x1 + s(x2))
summary(model)

Family: gaussian 
Link function: identity 

Formula:
y ~ x1 + s(x2)

Parametric coefficients:
        Estimate Std. Error t value Pr(>|t|)    
(Intercept) 278.1947     8.1994  33.929  < 2e-16 ***
x1            3.0458     0.4002   7.611 2.43e-11 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
        edf Ref.df    F p-value    
s(x2) 7.065  7.065 5790  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.998   
lmer.REML = 763.07  Scale est. = 106.55    n = 100

We can fit fixed or penalized splines using the fx argument in the s() function. The mechanism behind gam is to calculate different splines of your data and fit them at the same time to find which coefficients are significant, so the final product is the sum of several terms of the same variable (i.e. having x1 + (x1)^2 + (x1)^3 in a linear regression).

Penalized splines are the default here and tend to be more useful since they allow you to find the optimal number of needed terms in order to find the best overall fit for the data, giving you an overall spline with less degrees of freedom. Fixed splines will fit all the different terms generated and will be costly when it comes to the number of degrees of freedom.

This is an example of fitting a fixed spline on the same data, notice the difference in the degrees of freedom for the smooth terms:

summary(model)

Family: gaussian 
Link function: identity 

Formula:
y ~ x1 + s(x2, fx = T)

Parametric coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) 300.1657     6.7852  44.239  < 2e-16 ***
x1            2.8795     0.3457   8.331 9.76e-13 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
      edf Ref.df    F p-value    
s(x2)  10     10 4443  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.998   Deviance explained = 99.8%
GCV = 121.23  Scale est. = 106.68    n = 100

Fitting a categorical by non-linear interaction

You can fit a categorical by non-linear interaction in GAM just the say way you would in a regular linear regression, except that you need to pass your categorical value as an ordered variable. Please note that this method of fitting in gam only supports two groups. In the case with more than two groups this will break and you need to rely on a different model.

If you pass the group as categorical variable, then gam will give you two individual splines and will not test for the interaction term but only for the significance of the two splines individually (i.e. you won't be able to tell if there's a significant group non-linear interaction, only that there is a non-linear trend in each group).

This is an example of how to fit these interactions:

set.seed(1)
x2 <- rnorm(100, 5, 2)
error <- rnorm(100,0,10)
y1 <- 10*(x2)^2 + error
y2 <- (x2)^3 + error
x2 <- c(x2,x2)
y <- c(y1,y2)
category <- ordered(c(rep(1,100), rep(2,100)))
model2 <- gam(y ~ s(x2) + s(x2, by=category))
summary(model2)

Fitting a linear by non-linear interaction

In order to fit a linear by non-linear interaction, we need to rely on parametric bootstrap tests in order to obtain a valid p-value. An example on how to do this can be found on the Model Comparison wiki entry here.

You should not try to implement linear by non-linear interactions without using the pbkrtest package since the p-values from gam are not reliable in this case.

Plotting GAMs

There's two different ways of plotting GAMs, we can use vis.gam or plot.gam. Here are some example outputs:

Loading Data

set.seed(1)
x1 <- rnorm(100, 20, 3)
x2 <- rnorm(100, 5, 2)
x2.2 <- (x2)^2
error <- rnorm(100, 0, 10)
y <- 3*x1 + 10*x2.2 + error
model <- gam(y ~ x1 + s(x2))

Using vis.gam()

We start off by using vis.gam(). This function will create 3D surfaces of two regressors of your choice against the prediction. You can adjust which regressors you want to plot with the "view" argument within the function.

vis.gam(model, view=c("x1","x2")

This is an example of the type of plots you can generate with vis.gam()

Example plot generated from vis.gam()

Using plot.gam(model)

The second way of plotting gam models is using plot.gam(). This function creates 2D plots of the smoothed regressors against the predictor. It will automatically create all plots once you call the function, and you will need to press enter several times to see them all.

plot.gam(model)

This is an example of the type of plots that can be generated using plot.gam

Example for plot.gam

Reporting Results

The parametric coefficients can be reported in a similar fashion to a linear regression. I haven't been able to find any documentation of a standard way to report the results. However, in the past we have reported the p-value of the smoothing splines.