Testing for Nonlinearity in the Data - PennBBL/groupAnalysis GitHub Wiki

We can test for non-linearity in the data by using the restricted likelihood ratio test. In order to do this, we use the exactRLRT() function from the RLRsim R package. This function provides a restricted likelihood ratio test using simulated values from the finite sample distribution to test for testing whether the variance of the random effect is zero. In this case, In terms of generalized additive models, we need to either use gamm4() or gamm() from the mgcv package in order to test for non-linearity using the restricted likelihood ratio test.

#Testing Cross-sectional Data

For testing non-linearity in cross-sectional data, we can use exactRLRT(). In this case, the null will be that the variance of the random effect is equal to zero (or linearity of the model).

The mechanism behind gamm() and gamm4() is to fit the linear regressor as a fixed effect using a linear mixed effects model and the non-linear term will be fitted using the random effects, so if the variance of the random effects equal zero then there's no non-linearity in the fit.

The following code illustrates the implementation of the exactRLRT():

Create dummy dataset

set.seed(1)
x2 <- rnorm(100, 5, 2)
x2.2 <- (x2)^2
error <- rnorm(100, 0, 20)
y <- 10*x2.2 + error

Fit Model

model <- gamm(y ~ s(x2))

Test using the exactRLRT Function

exactRLRT(model$lme, nsim=300)

simulated finite sample distribution of RLRT.
	
(p-value based on 10000 simulated values)

data:  
RLRT = 5.4561, p-value = 0.006

In this case, we reject the null hypothesis in favor of the non-linearity of the model (which makes sense given that we have a quadratic regressor).

#Testing for longitudinal Data

For this section we will require the SPQ_VOL_TOY.csv dummy dataset. This first section deals with loading and cleaning the data and can be skipped.

Loading and cleaning data. This section can be skipped

# Read Data
################################################################################
data.analysis <- read.csv("TOY_VOL_DATA.csv")
data.analysis <- data.analysis[, -1]
data.analysis <- data.analysis[union(which(data.analysis$dx_group == "CRCR"), which(data.analysis$dx_group == "TDTD")), ]

# Define index of dependent variable
################################################################################
i=19
data.analysis <- data.analysis[!is.na(data.analysis$matedu), ]
var.name      <- names(data.analysis)[i]


# Getting rid of outliers, anyone that has a change above 3SD of the mean
# Getting rid of anyone that doesn't have the two timepoints
################################################################################
temp.outlier      <- data.analysis[, c(1,2,i)]
temp.outlier.wide <- reshape(direction="wide", temp.outlier, 
                             timevar = "timepoint", idvar = "bblid")
temp.outlier.wide$diff <- (temp.outlier.wide[, 2]- temp.outlier.wide[, 3])

outliers          <- temp.outlier.wide[is.na(temp.outlier.wide$diff), 1]
temp.outlier.wide <- temp.outlier.wide[!is.na(temp.outlier.wide$diff), ]
outliers <- union(outliers, temp.outlier.wide[which(temp.outlier.wide$diff > mean(temp.outlier.wide$diff) +     3*sd(temp.outlier.wide$diff)), 1])
outliers <- union(outliers, temp.outlier.wide[which(temp.outlier.wide$diff < mean(temp.outlier.wide$diff) - 3*sd(temp.outlier.wide$diff)), 1])
outliers <- union(outliers, 117595) #massICV is less for t2 than t1

for (j in 1:length(outliers)) {
  data.analysis <- data.analysis[-which(data.analysis$bblid == outliers[j]), ]
}

# Data Cleaning 
################################################################################

# Passing sex and diagnosis as ordered, doing further data cleaning
# Adjust day difference as 0 for t1
# Pass race as only white/other
# Female is higher here
data.analysis$sex      <-ordered(as.character(data.analysis$sex), level=c("male", "female"))
data.analysis$go2DaysFromGo1Scan <- as.numeric(as.character(data.analysis$go2DaysFromGo1Scan))
data.analysis$go2DaysFromGo1Scan[which(data.analysis$time == 1)]  <-  0
data.analysis$race[which(data.analysis$race == "black")]          <- "other"
data.analysis$race     <- as.factor(as.character(data.analysis$race))
data.analysis$dx_group <- ordered(as.character(data.analysis$dx_group))
data.analysis$dx_diag  <- ordered(as.character(data.analysis$dx_diag), 
                                  levels=c("CR","TD"))
data.analysis$dx_sex <- 0
data.analysis$dx_sex[which(data.analysis$sex == "male" & data.analysis$dx_diag == "TD")] <- 1

Testing for Non-linearity in the data

Define Formulas for the big and small models

covariates  <- " ~ s(age)"
covariates.small <- "~ + age"
m.big    <- as.formula(paste(var.name, covariates, sep=""))
m.small <- as.formula(paste(var.name, covariates.small, sep=""))

Run full model and Model with only the random effects to be tested

gam.big <- gamm4(m.big, data=data.analysis, random = ~ (1|bblid))
gam.smooth <- gamm4(m.big, data=data.analysis)

Get the design matrix and copy the normalized data for age

big.matrix <- getME(gam.big$mer, "X")
data.analysis$age <- big.matrix[, 2]

This step is necessary in order to be able to run the model. You need to unwrap the normalized fixed data for your variable of interest from the big model and pass this as a fixed effect in the reduced version. In this case, you need to call getME to obtain the fixed effect matrix and identify the column that corresponds to your variable of interest.

Run reduced model

gam.small <- gamm4(m.small, data=data.analysis, random = ~ (1|bblid))
summary(gam.small$mer)

Sanity Check Step

small <- getME(gam.small$mer, "X")
big <- getME(gam.big$mer, "X")
any(small != big)

If you extracted the normalized data from the big model and used it to run the small model, then you should get a false here. If you get true, exactRLRT will fail.

Run Exact RLRT

exactRLRT(gam.smooth$mer, gam.big$mer, gam.small$mer)