lme4: Mixed‐effects modeling in R - Private-Projects237/Statistics GitHub Wiki

Overview

This wikipage is going to be detailing the chapters from lme4: Mixed-effects modeling with R by Douglas M. Bates. The goal here is to have a more concise version of the book- mentioning and highlighting information that is most relevant and skipping over information that may not be exactly needed. The goal here is more on how to practically use these models and how to specify them correctly.

Chapter 1: A Simple, Linear, Mixed-effects Model

The difference between a fixed and random effect is more about the property of the levels of the categorical covariate than a property of the effects associated with them.
Secondly, there is a distinction between "fixed-effects parameters", which are actual parameters that are estimated by the model and "random effects", which are technically not parameters. The variance of the random effects are a parameter.
Another to differentiate fixed from random effects is by thinking of what would happen if more data (observations) were included. If the unique levels increase, then that is a characteristic of random variables (ex: increasing student number increases the level of student). If the levels do not increase (ex: Sex) then that is a characteristic of a fixed effect.
Mixed models incorporate both fixed and random effects- any model with a random effects must always have a fixed effects component (aka the intercept).
We can visualize the variability of a factor (like one we will model as a random variable) prior to running the model to get a feel for between-factor variability and within-factor variability.
It is also good practice to see how many responses are nested within a factor or even several, this can be easily investigated through the xtabs() function.
When running mixed-effects models, the "effect" of a factor that we model as random is NOT of interest to us- if it was we would model it as a fixed-effects. Instead, we care much more about modeling or predicting the outcome while controlling for the between-factor and within-factor variability of the random effects.
Fun fact, it is more difficult to model small random effects than it is for larger one. Just because the model states or produces a 0 for the variance of a random variable, does not necessarily mean that there is no variability between the factor. It just means that it is very weak, therefore we can justify removing the variable as a random effects.
By default, when fitting a linear mixed model in R with the lmer() function, the produced estimates are done through Restricted Maximum Likelihood (REML). This is the preferred approach since it produces more accurate variance estimates. However, this approach is not preferred for model comparison- in those cases Maximum Likelihood (ML) is essentially required. Thus, it looks like models should all be fit with ML, compared to each other to find the best one, then refit it using REML.
When fitting a simple model with one random effect, let's say modeling a factor with random intercepts, this will produce two types of variance, one that reflects between-factor variability and another that is the within-factor variability. When fitting an empty model, one without any fixed effects with the exception of the intercept, these estimated variances for the random effects basically indicate how much of the variability of the outcome can be attributed to the group aka cluster level (calculating the ICC).
When fitting mixed-models, we can think of the model being generated multiple times, and each time producing estimates that differ. Eventually, the produced estimates will become more and more similar and if they reach the point where they are extremely similar, the model will conclude and display these estimates. This means that the model was successful or converged. You can print out these iterations by specifying verbose=10L within lmer().
As mentioned, the random effects themselves are not a parameter, but it's variance or standard deviation is. However, we can still calculate essentially 'estimates' for each level of the factor. While it is easier to conceptualize these as estimates, they are technically called conditional modes and is related to the point of maximum density, something a bit complicated that I don't understand. Anyway we can extract this information using the ranef() function.
The ranef() function will always produce the conditional modes for all the levels of all random variables specified in the model.
Other terminology to be familiar with is scalar random-effects term. Basically, if we have one random effects in the model, specified as (1|Factor), this means our random effect is scalar or simple. In the next chapters we will specify models with multiple simple scalar random effects and models that go beyond that.
We can visualize the conditional modes and their corresponding 95% prediction intervals using the dotplot() function from the lattice package. However, when you have many levels for your random effects term, use the qqmath() function instead.
Overview, this chapter is more about how to fit simple random effects models without any predictors in the model. It also gives you some information about what these models are essentially doing, why they are important, and some technical details about model comparison and extracting information from the random effects.
Disclaimer: A good chunk of information related to profile zeta was left out on purpose by me because a) who cares and b) it seems impractical to check with large models.

Datasets and code

Datasets to familiarize ourselves with are those provided by the lme4 package, which are the Dyestuff and Dyestuff2. They both have the same variable names and observations- the main difference is the former has noticeable batch-to-batch variability (grouping factor) while the latter does not.

These datasets can basically be used to investigate what is contributing to the variability in dyestuff yield. Since there is only one factor in the dataset, we can basically use this factor to see how different types of batches contribute to that variability and estimate what the average yield of dyestuff will be (the intercept).

# Load in packages
library(tidyverse)
library(ggplot2)
library(lme4)
library(lattice)

# attach the dysestuff dataset
data("Dyestuff")
data("Dyestuff2")

# View the dataset Dyestuff
str(Dyestuff)
head(Dyestuff)
summary(Dyestuff)
xtabs(~ Batch, Dyestuff)

# View the dataset Dyestuff2
str(Dyestuff2)
head(Dyestuff2)
summary(Dyestuff2)
xtabs(~ Batch, Dyestuff2)

`Dyestuff` descriptives	`Dyestuff2` descriptives

Below we can clearly see the distinction between the two datasets- not really in terms of the scale units but visually just by how there is much less between-factor variability in Dyestuff2. This does not mean that there is none, but the variability in the response variable is clearly more evident within batch.

Batch-to-Batch Variability `Dyestuff`	Batch-to-Batch Variability `Dyestuff2`

Fitting the models

# Fittin the models
summary(fm01ML <- lmer(Yield ~ 1 + (1|Batch), Dyestuff))
summary(fm02ML <- lmer(Yield ~ 1 + (1|Batch), Dyestuff2))

Below are the outputs of the same model used with different datasets. Notice these models basically do not have any predictors, however, there is a mandatory estimate for the intercept, which is the only fixed-effects present. Both models used REML to produce estimates as it is the default. We get estimates for the variance/standard deviation of batch, the residuals portion, and the estimate for the intercept- so three things are being estimated in these models. Additionally we see the message 'boundary (singular) fit: see help('isSingular')' for the second model, which means that the model produced a 0 for the random effects variable- this is common is it generally is weak, and this can be used to justify its removal- saving a degree of freedom (or smt like that).

`Dyestuff` Model Output	`Dyestuff` Model Output

Visualizing the random effects

# Extracting random effects
ranef(fm01)$Batch
ranef(fm02)$Batch

# Plotting the random effects (conditional modes and their 95% prediction intervals)
dotplot(ranef(fm01, condVar=TRUE), strip = FALSE)
qqmath(ranef(fm01, condVar=TRUE), strip = FALSE)

We can use the ranef() function to produce conditional modes, one for each level of the factor functioning as the random variable and we can also plot it using dotplot() from the lattice package or qqmath(). We will do this for the Dyestuff dataset since it produced a non-zero variance estimate. Notice the plots below also produce 95% prediction intervals.

`dotplot() of Dyestuff`	`qqmath() of Dyestuff`

Chapter 2: Models with Multiple Random-effects Terms

In this chapter we will be fitting models with more than just one random effects term. Specifically, we will be adding multiple simple scalar random-effects terms.
If we have two random effects terms (grouping factors) in a model, we can classify them as crossed, nested, or partially crossed.
Beginning with crossed, this means that there is data for the response variable in all levels of the first grouping factor, all levels of the second grouping factor, and all level combinations of both factors. The easiest way to check this is by using the xtabs() function, and seeing whether there are any 0's in any of the factor level combinations. If there are no 0's then the data is crossed, if there are no 0's and there is an equal number of observations across all level combinations, then we say it is also balanced.
However, it is rare and extremely unlikely that we will have a balanced dataset- thus this is why statisticians allow for missing data.
Nested factors are the opposite of crossed, this is when the levels of one factor are not present in more than one level of the second factor. An easy example to outline this is a dataset where the response variable is student math test scores, and the random effects terms are students and teachers. If each teacher has a unique set of students, meaning there is no overlap in students between classrooms, then we would considered these as nested factors, which when using xtabs(), would produce several 0's across the combination levels of teachers and students.
Nested factors are also the factor types that give way to calling mixed models as hierarchical linear models or multilevel modeling. The nesting aspect is what created 'levels', where these do not correspond to the levels of a factor- but instead to predictors that can be considered lower levels (nested within the levels of a higher order factor) and higher levels- the predictors that would be attributed to the higher order factor. This is confusing to think about, but a very common way to explain mixed models since it's a popular but special case of mixed models.
It is very important to differentiate between crossed and nested factors in order to specify the mixed models correctly. Failing to do so will produce unintended random effect variance and model your outcome incorrectly.
However, it is also important to mention that most datasets, especially longitudinal datasets are going to contain partially crossed random effects- so it is also important to know what that means and how to specify that correctly into a model.

Crossed dataset

To demonstrate a dataset with crossed random effects, we will use the Penicillin dataset where there is one response variable 'diameter', which measures the diameter of space within a petri dish where bacteria did not grow, and two factors which are plate with 24 levels (24 petri dishes were prepared) and sample with 6 levels (6 different types of penicillin were used within each petri dish). Therefore, we see that the factors are crossed because all 6 samples of penicillin are present in all 24 petri dishes.

# Attach dataset
data("Penicillin")

# View the data
str(Penicillin)
summary(Penicillin)
xtabs(~ sample + plate, Penicillin)

Below we have some descriptions about the dataset. We can see the three variables that make up the Penicillin dataset and the two factors along with their respective number of levels. With summary we can get some quick information about the spread and central tendencies of the outcome (diameter). Lastly the output from the xtabs() shows how are dataset is both crossed and balanced.

Data description

Next we can visualize both factors simultaneously in relation to our outcome. Again we are modeling these factors as random effects, so we are less interested in seeing which factor level is greater associated with an increase or decrease in the outcome. Instead, we care more about visually capturing variability between the level factors- both for plate and for sample. Visually we can see that from the two, sample seems to be the one that produces the most variability overall.

Visualizing the Random Effects in Relation to Diameter

Fitting Model for Crossed dataset

Fitting the model is pretty easy in this case, all we do is just add two random effects terms.

summary(fm03 <- lmer(diameter ~ 1 + (1|plate) + (1|sample), Penicillin))

Below we show the output of the model that used REML. Here we can compare the variability estimates between both random effects terms. As expected, we can see that the variability of sample is much larger than that of plate, indicating that it plays a larger role in the outcome's variability.

Model Summary Output

Nested dataset

To introduce the nested dataset, we will be using the Paste data. This dataset is interesting because it is purposely organized in a way where if you do not know what you are doing, you will specify the model incorrectly. This is because the dataset has a variable named 'cask', where it is an implicitly nested factor. This variable essentially has three levels, but these levels SHOULD NOT be assumed to be the same across the levels of the second factor. Failing to know that will produce the wrong model specification. This variable was purposely created like this to be confusing since these types of implicitly nested factors are common both in textbooks and in datasets.

Anyway, this dataset is looking at the strength of different pastes. The data were sampled by picking 10 deliveries at random. Each delivery has several casks or barrels, where within each bask has a bunch of paste. Three casks were picked at random from each delivery, and they were labeled 'a', 'b', and 'c', but they are not related to each other at all (implicitly nested factor). From the cask, two samples of paste was collected. Thus, the unique number of paste collected is in the variable sample, where two samples were collected for each delivery batch cask combination.

# Run a model with nested random effects
data("Pastes")

# View the data
head(Pastes)
str(Pastes)

# See how the outcome varies across different levels of a factor
xtabs(~ batch + sample, data = Pastes)
xtabs(~ cask + batch, data = Pastes)

Below we have descriptions of the data. First we start with the summary function, which here we can get some descriptives of the spread and central tendencies of the outcome (strength as in how well the paste works). We see that there are four variables in this dataset, one response variable and three factors. Two of these factors are of interest to us (batch & sample) and the other is there to cause confusion (cask). However, we will show later how we can use cask to correctly specify the model to showcase nested data.

Additionally, we further show how we know the data is nested by using xtabs(), we can see all the zeros (they have been changed to . to decrease the size of the table) present in the level combinations of sample and batch. Additionally, we also show the output of xtabs() by using cask, and how this can create the illusion of crossed data, when in reality it is not!

Description of the data

The graph below shows us how the outcome (strength) does vary across levels of batch, where for example batch E produces lower strength and batch A overall produces higher strength in paste. Again, we do not care necessarily about the effects of batch, but are just pointing out that there is variability between the levels of this factor. Additionally, we can see differences across batches in terms of the within factor variability, where there is more discrepancy in strength for batch I while it is much more consistent for batch A and batch J. Lastly, we can see that within batch, the colors representing cask, which represents the same samples, are much closer to each other than that of samples from other casks. This is very obvious in batch D, where the blues are close to each other at the top and the reds are close to each other at the bottom- this indicates that there is very little variability between strength of the paste of the same samples.

Visualizing the Implicitly Nested Random Effects

Fitting the Model for Nested Data

# Fit the models (Maximum Likelihood)
summary(fm04 <- lmer(strength ~ 1 + (1|sample) + (1|batch), Pastes, REML=FALSE)) # Estimate the variance of batch
summary(fm04a <- lmer(strength ~ 1 + (1|sample), Pastes, REML=FALSE)) # Do not estimate the variance of batch
summary(fm04b <- lmer(strength ~ 1  + (1|cask:batch) + (1|batch), Pastes, REML=FALSE))

# Model comparison of including random effects of batch
anova(fm04a, fm04) # Not significant

First I want to deviate from the book to compare both ways to specify a nested mixed model (aka hierarchical model or multilevel model). Essentially, we need to really think about what we want the model to produce and that goes back to what our data basically is. We have two measurements for each sample of paste. There are exactly 30 unique pastes in our datasets, thus they should all have a random intercept. Additionally, these pastes are nested within levels of delivery batch, meaning all pastes belong to a single level of delivery batch, and that should also be modeled. Therefore, we would expect the model to output 60 observations, 30 random intercepts for each unique sample and 10 random intercepts for each unique delivery batch. Any values deviating from this means the model was specified incorrectly.

Below we have two models that are essentially producing the same thing. In the first model, we include the variable sample as a random effects along with batch, thus producing the correct number of random intercepts we would want. In the second model, we specify (1| cask:batch), this is basically the same as generating the samples variable within the model specification, because that produces cask and batch level combinations, which is equal to the same number of levels as samples.

Having `sample` as a Random Effect Term	Indirectly Creating `sample` and Having it as a Random Effect Term

The next thing to discuss is the output of the random effects. Something very important which I will not go into detail here is that the variance of the random effects for batch is very small compared to that of sample. On further inspection with profile zeta plots, it looks like there are instances where this estimate can be 0. This gives us enough probable cause to test whether or not this variable actually improves model fit at all or not. We can test this by specifying another model that does not have batch as a random effects variable, and then comparing it to the model that does. If the p-value from the comparison is not significant, then that means the more complicated model does not significantly improve fit- thus the more parsimonious model would be the best one to interpret.

Below we showcase the model comparison using the anova() function and the p-value was not significant. Thus, this means that we should go with the simpler model that does not have batch as a random effects.

Model Comparison (Batch as Random Var vs as Removed)

Partially Crossed dataset

To examine a partially crossed dataset, we will be using the quite large InstEval dataset. This dataset is brushed over a bit in this chapter and that's because it will be used again in the following chapter to examine the fixed-effects. This is basically a cross-sectional dataset where thousands of students voted or ranked a little over a thousand teachers in terms of how good their are (y). There are many factors in this dataset, some of which we can model as random effects others as fixed effects. We will only model random effects as following what the chapter outlined.

# Load the data
data("InstEval")

# View the data
str(InstEval)
summary(InstEval)

# Visualize the outcome and have a new variable made up of department service combination
xtabs(~ y, InstEval)
xtabs(~ dept + service, InstEval)
xtabs( ~ s + d, InstEval, sparse = T)

Below we get some information about our dataset. We can see that there are 73,421 observations, which is a lot! And we have information from 7 variables. One of these includes our outcome of interest (y) which are ratings on a 5-point Likert scale for a teacher by students. Due to the very large observation number, we can see that students voted multiple times and that teachers have been voted by multiple times- however, the data is not crossed or else that would mean that each teacher was rated by each student and that is not the case. Additionally, the data is not nested because that would mean that students could only vote for one teacher, which that is also not true here.

The summary for many of the variables below are not that as important, the best look at the data definitely comes from str(). Lastly we can see the frequency of the different ratings in the data. Even though there are 5 ordinal ratings, due to how many we have we can model this variable as continuous.

Data Description

While above we could logically deduce that our data is partially crossed- we can also easily verify this again by using the xtabs() function. Below we can see first that there are many 0's in the level combinations, which would point to a nested structure. With this type of data, a nested structure would mean that teachers would all have students who were only taught from them, therefore a student could only give one teacher a vote. However, we see that in row 3 (student), there are several 1's in that row, indicating they voted multiple times- thus there is no nesting in the data and instead we have partially crossed factors.

Frequency of Level Combinations Between Student and Teacher

Below is a table for the combination of levels between Department and Service. This is being showed because the textbook is going to use this combination of levels to generate random intercepts. It is also not specified why this was done, but it does show that we really have full range when it comes to specifying our model in terms of the random effects terms.

Department Service Interaction

Fitting the Model for Partially Crossed Data

Below we fit the model for the partially crossed data. We can see that three different factors were specified for the random effects, thus their levels will all be given a random intercept and the variance of those intercepts will be estimated by the model. The first two terms are pretty straight forward, s is for student while d is for teacher. We also included the dept:service, which will give every combination between the levels of those two factors their own intercept. As mentioned, I am not sure why this was specified, I think it was to prove a point later on it in terms of model comparison.

# Fitting the model
summary(fm05 <- lmer(y ~ 1 + (1|s) + (1|d) + (1|dept:service), InstEval, REML=FALSE))

Below are the results of our empty model with three random variable terms. We see that each one has their own estimated variance for the intercepts. We see that compared to the residual variance, they are quite small. Lastly, and most importantly, we can see exactly how many random intercepts were produced by the model. 2972 for students, 1128 for teachers, and 28 for department and service combinations. This looks correct, telling us that our model was specified correctly. This information should capture the 73,421 observations well (i think).

Model Summary Output

I will not sure this part with code because it is not that important, but basically in the textbook they specified another model where (1| dept:service) was removed. The initial suspicion is that it's variance is so small that it is probably not contributing anything to the model fit. However, that was found out not to be true, since the model comparison produced a significant p-value. The author says that this discrepancy is due in part to the large sample size (he literally means the observation number of 73,421 and calls that the sample size not the number of students nor teachers). The author then mentions that now the researcher should decide whether or not this is meaningful to model regardless of whether it was significant or not. Additionally, we can always change up the random effects if we wanted to- it does not have to be this exact way that it was specified.

Chapter 3: Models for Longitudinal Data

This chapter focuses on longitudinal data, which is basically data taken from the same subjects across time periods.
We want to characterize the response across times as within subjects and between subjects (currently unsure exactly what this means, however, it makes sense to think of subjects as a factor where an average slope is produced by the fixed effects and the variance of varying slopes are captured as a random effects + allowing for intercepts to vary to capture differences in baseline performance, where the predictor is equal to 0).
A popular dataset to practice analyzing longitudinal data is the sleepstudy dataset, where there are average measures of reaction time for 18 subjects at 10 different days (time points), where each day the subject gets only 3 hours of sleep.
For this type of data, if the sample size allows it (smaller sample size would be preferred) we can plot the time vector on the x-axis, the response variable on the y-axis, and have faceted by subject (or the experimental unit). Each facet would show the within subject relationship by providing a simple least squares line fit. Additionally, it would be good to arrange the plots by increasing or decreasing y-intercepts, which could hint at a relationship between starting ability and the slope (ex: people who perform better on Day1 are less likely to have a negative response to sleep deprivation). Also, make sure that the x and y-axis for these plots remain constant across the facet for comparisons to be meaningful. This is known as the Trellis graphics principles of plotting. This can be done using functions from the lattice package but I am unsure how to use them- they are in their own separate textbook.
Just like we have done for previous chapters, it is good to use xtabs() to identify if data are present for each subject across all time points.
Additionally, while not explicitly mentioned in the chapter, a couple of important features are important to know when specifying these models. For starters, when modeling time as a random variable with random slopes, it must be a numeric variable and not a factor. Also, to make sense of the varying slopes you must also introduce slope as a fixed effect. Thus at minimum, two fixed effects needs to be specified into mixed models with longitudinal data.
By default when adding random slopes into the model, lmer will correlate the slopes to the y-intercepts. The model can be specified to prevent this if justifiable, in cases where the correlation is close to 0 this is okay to do and will produce a simpler model.

Longitudinal dataset

This chapter only uses data from the sleepstudy dataset, so this is what we will be using to learn how to model this type of data.

# Load the data
data("sleepstudy")

# View the data
str(sleepstudy)
xtabs(~ Days + Subject, sleepstudy, sparse = T)

# Plot the data
sleepstudy %>%
  ggplot(aes(x = Days, y = Reaction)) +
  facet_wrap(~Subject) +
  geom_point() +
  geom_smooth(method = "lm", color = "black", se = F) +
  theme_classic()

Things get a little bit confusing in this dataset because this is the first time we see data where there is one factor (subjects) and two continuous variables. Yes, the variable day in this dataset is numeric and should be thought of as such. Mainly because we need this variable to produce a slope representing the relationship between itself and the outcome (reaction time). We can also think of this variable as a factor, and investigate how much data was collected for each level of subject and each level of day. Thus below we can see that this data, in terms of the time variable, which again is numeric, can also be thought of as a factor to produce a crossed dataset. Yes this is confusing.

Data Description

Below we can see the relationship between days and reaction time for each subject, so that can be a visualization of the conditional modes (slopes) for each person (in the model these values will not be estimated). Additionally, we can see how subjects vary in terms of their baseline performance (the y-intercept), with some subjects being more resistant to the effects of sleep deprivation on Day0 compared to others (as long as the days predictor is not centered, the variation in the random intercepts will represent this- I think).

Within and Between Subject Reaction Times in Response to Days

Fitting the model (correlated random effects)

For this chapter, we are basically comparing two models, the one with the random effects correlated and another where they are not. These is little explanation of the logic of anything else done in the specification of this model, thus we will be trying our best to fill in these gaps.

The model specification below has Subject and Days as random effects, where subjects is specified as a factor where each of its levels will be given an intercept and Days will be the slope between Days and Reaction Time for each level of Subject. Therefore, day slopes will be allowed to vary randomly while random intercepts of reaction times will vary randomly for each subject. In addition to that, two fixed effects are modeled, the first is the intercept, which has always been modeled in every previous example but now days, which will be the overall (mean) slope between days and reaction times.

summary(fm06 <- lmer(Reaction ~ 1 + Days + (1 + Days|Subject),
                      sleepstudy, REML=FALSE))

Below we see the model summary of what we specified with maximum likelihood. We see that for the random effects, there are three variances present (along with their corresponding standard deviations). The first is the variance of the random intercepts, this tells us how much reaction time varies in Day0 across all subjects- it is pretty big in comparison to the residual variance, which is variance of the errors not explained by the model (within subjects residuals). The next is the variance the slopes for days, notice that (Intercept) is not present for this row, indicating that this is slope variance. This value tells us how much we expect slopes to be different from each other- this will make more sense when we explain the fixed effects.

For the fixed effects portion, we see that there are two of them. First we will start with the intercept, which is 251 milliseconds (ms). This values represents the expected value of reaction time for all subjects on Day0, aka before sleep deprivation is introduced. However, we know that subjects will vary in their performance for Day0, so we can use information from the random effects to create a window of where we would expect most people's reaction times to be. The standard deviation of the random intercepts (SD = 23.780) gives us this information, thus if we multiply that by 2 and subtract/add it to the intercept, it will produce us a range where 95% of people would be expected to be at [203.44 - 298.56]. Again this information only applies to the variation in performance in day 0 and cannot be assumed for any other of the days. Next, we can do the same thing to calculate the expected subject-to-subject variation in the day slopes. We see that the fixed effects coefficient of Days is 10.467 and the standard deviation of the random slopes is 5.717- thus, we would expect the slope between day and reaction time for 95% of people to be between -0.967 and 21.901 ms./day. We can also interpret the variance of the residual as well. Since it has a standard deviation of 25.592, then that means on average the scatter around the lines of best fit for each person is about +/- 50 ms.

Lastly, it was mentioned that the random effects for this model by default are allowed to correlate. This is evident with the 0.08 correlation, which was estimated by the model. This correlation basically means that as the intercept for someone increases, we would expect their slope to also increase. The issue is that this correlation is very small, and likely not meaningful- aka there is no relationship between intercepts and slopes. So this gives us reason to fit a new model with this aspect not included.

Model Summary Output

Fitting the model (uncorrelated random effects)

Below we fit a model were we do not want the intercepts and the slopes to correlate. We do this by writing the random effects for subjects twice. In the first part (1|Subject), this specifies that we want to model the variable Subject as random and that each of its level will have its own intercept. We also add (0+Days|Subject), this part specifies that Days will be a random variable and that it will produce random slopes, one for each level of Subject, however, the inclusion of the 0+ will prevent the correlation of the random effects.

# Fitting the model (uncorrelated random effects)
summary(fm07 <- lmer(Reaction ~ 1 + Days + (1|Subject) +
                        (0+Days|Subject), sleepstudy, REML=FALSE))

# Model comparison
anova(fm07, fm06)

We won't spend too much time covering the details of the model below since we already covered this in the previous one. Main takeaways is that the fixed effects between this model and the correlated random effects are identical. The variances of the estimated random effects are a little bit different but close and we see that the correlation between the two is not present.

Model Summary Output

Below we see that the p-value for the model comparison is not significant- thus this means we should go with the simple model where the random effects are not correlated with each other.

Model Comparison

Viewing the Random Effects (Conditional Modes)

It does not matter which of the two models from above are used- here is just a way to visualize basically what is going on in terms of the random effects. We can use the ranef() function to extract the conditional models for all of the random effects in the model. Additionally, we can view these along with the 95% predicted values for them easily using the dotplot() function.

ranef(fm07)
dotplot(ranef(fm07, condVar=TRUE), strip = FALSE)

Below we can visualize this information- we see that there are two types of random effects, one for the intercepts and the other for the slopes and that each were calculated for each level of Subject. In the dotplot() plot, the x-axis are held the same, thus we can see that the variation in intercept is obviously going to be much larger than that in the slopes. We also get information about the names of the Subject level (aka the actual subjects) and use this information to get a gist of their overall susceptibility to sleep deprivation (subjects with the lowest slopes).

`ranef()` output	`dotplot()` output

Mixed Effects Model vs All-Fixed Effects Model

We will now make a direct comparison between an all-fixed effects model and the output of the random effects model to have a better understanding of what is going on. Starting with the all-fixed effects model, what we can do is use the different approaches mentioned in the book, or we can simply run a for loop where the intercept and slope is calculated from each subject's data and only their data- meaning that the data is subsetted by one subject, an OLS model is run, and the intercept and slope are obtained. This is done 18 times, and what we get from this are 18 intercepts and 18 slopes, where it shows each person's within-subject, independently calculated from other the data of the other subjects.

With the mixed-models approach, we can do a lot more than this. We can use all of the observations in the data to help us calculate the fixed effects and the random effects. In this case, the fixed effects would be an average intercept and an average slope, which essentially represents population estimates. You can actually calculate these same estimates from the 18 fixed effects models by averaging the intercepts and the slopes- but the point is the mixed model does this automatically and provides p-values aka significant testing for both.

An extremely important concept to gain from this is that mixed models are NOT estimating slopes and intercepts for each person, or else their conditional modes would be identical to that of the fixed effects. Instead, when we calculate the standard deviation of both the slopes and intercepts from the all-fixed effects models, we see that these values are larger then the standard deviations produced by the mixed-model. This is because of a phenomenon known as shrinkage, which is occurring. Basically, the mixed models are able to use information from all subjects, identify what the common pattern among them is ("borrow strength from each other's estimates"), and then use that information to influence the slopes or intercepts of subjects that are a bit of outliers. In a way, it is almost like the model is correcting for potentially noise in the data, due to a process called partial pooling, to reduce overfitting. For more information on this, please visit: Shrinkage in Mixed Models.

Visually we can see the "shrinkage" in the graphs below- one was available directly from the textbook and the other I created using ggplot() just to confirm that I understood what was going on. We basically have the estimates (intercepts, slopes) plotted for all subjects for the all-fixed effects model along with the conditional modes from the mixed model- so we have outputs from technically 19 models on display here. The part in the center is essentially the population estimate for the intercept and slope (in my graph the variables are centered so this value is being expressed as 0). We can see that the produced conditional modes are closer to the population estimates than the intercepts and slopes from the all-fixed effects models, showing shrinkage in action. Basically, this is a good thing and shows how mixed models are better at capturing relationships in the data.

Figure from the book	Recreated Figure using `ggplot()`

There is also a bit more to the story than just mixed models make the intercepts and slopes be closer to the population estimates. While this is the takeaway we are getting from the two plots above, we can see a more fine grain detail to what is occurring with the plot below. For example, here we see that within-subject intercepts and slopes in black, generated from the 18 independent regression models, and we see the conditional modes generated from one mixed-effects model. In addition, we can see the population estimate for the slope and the random intercept embedded on each of the facets.

The main highlight below is the comparison between the predictions for subject 330 and subject 337. Basically, what we do not see is the intercept and slope from the mixed model being dragged ("shrunk") closer to the population estimates. Instead, it basically overlaps with the black line, which is that produced by the single regression for that subject. Why is this the case? Because that specific line does a very good job of explaining the data (the R^2 there is pretty high), and moving that line closer to the population estimates will drastically decrease its predictability. On the other hand, for subject 330 we see a different story. In this case, the OLS is not doing a good job of fitting the data, mainly because for this subject the data is creating like a U-shape pattern- this is likely a byproduct of noise in the data. The mixed-model an identify this, and instead modify the slope and intercept to be more similar to that of the population estimates. This is a clear example of how the model is able to "borrow strength" from the other estimates, which then leads to "shrinkage". It is very cool math honestly.

Within-Subject Estimates vs Conditional Modes

Chapter 4: Building Linear Mixed Models

In this chapter we begin to explore different ways to specify mixed-models, with each dataset having their own differences that give rise to different ways to investigating the relationships within the data.
The focus for this chapter is specifying more complex relationships for the random effects and to cover the interpretation of factors modeled as fixed effects, which is different from having continuous predictors that produce slopes.
Starting with the Machines dataset from the 'Machines' package, here we have 6 workers who were rated on their productivity (score) for three different machines. Each worker used all three machine three times, thus these data are crossed. When plotting score by machine and worker, we see a glimpse at an obvious interaction, mainly because a combination of worker and machine levels, specifically worker 6 with Machine B, is producing scores that greatly deviate from the other data points.
With the suspicion that there is an interaction present, we have different ways to addressing this. We can create a model where we give the level combination from both factors a random intercept, or we could introduce random slopes. The latter will require many more parameter estimates, however if the model comparison shows that it fits the model best, then that should be the approach to go. Also, modeling random slopes for a factor, in contrast to a numeric variable like we used in Chapter 3, will require multiple data points for that level combination. If there is only one datapoint present, then this approach will fail (xtabs() on the Machines data show that there are three data points for each level combination as mentioned).
It is also important to note, that while it seems that including more random effects into the model is overcomplicating it, technically (and we can prove this when using anova()), these models are much more simpler than modeling the interactions as fixed effects, because less parameters are being estimated!
For the ergoStool dataset, here we explore a dataset with two factors and one continuous outcome, and we do so by producing three models: One with both factors modeled random effects, one with Subjects modeled as random while stool type is fixed, and one where both factors are fixed.
This is really more for learning purposes if anything, for starters, it does not make practical sense to have both factors be random and this is stated by the author as well "Such a model (one with both factors modeled as random) may not be appropriate for these data where we wish to make inferences about these particular four stool types... if the levels of the Type factor are fixed and reproducible we generally incorporate the factor in the fixed-effects part of the model".
The only advantage from both factors being random is that it produces the 'simplest' model- as discussed, introducing factors into the fixed effects requires more parameters to be estimated, but this pro does not (in my opinion) make up for the con that we can't make any inferences about the effects of stool type, which is of interest.
The best model would be the one with subjects as random effects and stool type of fixed effects. This will inform us whether difference levels of stool type are significantly different from each other, while controlling for differences in subjects, not inflating our chances of a type 1 error, and producing estimates for the fixed effects that are more generalizable, which is the point of doing stats.
For fun, I also fit a model where subject was modeled as random and type was modeled as both fixed and random- this model needed one more parameter to be estimated and performed identically to the random subject fixed type model- thus it completely sucks. However, I wonder if this output would have been different if there were multiple responses for each level combination of factors in the data instead of just having one response.
Lastly the third model is one where we produce a model with only fixed effects, so essentially it is not even an lmer() model anymore since there are not random effects at all (ignoring the residual variance). This model is weird, and the author states that statisticians would create this model by regarding subject as a blocking factor. So in this type of model, "we are not interested in the effects of the levels of the blocking factor- we only with to accommodate for this source of variability when comparing levels of the experimental factor, which is the Type factor."
From my own interpretation of this, since modeling fixed effects produces a more complicated model than modeling the factors as random, maybe this introduces a very aggressive controlling aspect, where the full effect of subject is controlled for, so that we can obtain even more accurate estimates of the Type factor. Additionally, whatever effects we do fine can basically not be generalized at all, and I think can only remain true within the context of the study and its sample. Anyway, to run this model we can use the aov() function, which is what the author does. Additionally, the blocking factor must precede the experimental factors. We can then extract only the coefficients related to the Type factor and pairwise compare the levels using TukeyHSD().
The following section the author mentions that from the three different approaches we mentioned that the random effects and fixed effects model is "preferred, perhaps even necessary, if we wish to make inferences that apply to the population of potential users. Also, when er have unbalanced data or large samples, teh flexibility of mixed models becomes important in stabilizing the estimation of parameters".
Because the bullet point above was the main take away from this section, we will skip the code for the ergoStool dataset entirely.
The next dataset is the classroom dataset, a large dataset with 1190 observations and 11 variables. In this section we learn a pretty cool trick, while in previous sections we learned to use xtabs() to check if there is full nesting between two factors, now we have use the combination of with() and isNested() to tell us if that is the case. If a factor is truly nested (not partially) within another factor then the output will return TRUE- this was confirmed with the 'Paste' dataset and a second version of the Paste dataset that I modified to produce partial nesting, the latter returned FALSE.
Unfortunately after that the chapter just does not finish giving any more detail about this dataset, introduces another dataset named ratbrain and explains nothing about it- it's like the chapter was forgotten to be finalized.

Machines Dataset

As mentioned, the main purpose of this dataset is to practice how to specify mixed models when we suspect an interaction between the levels of two factors, which we would want to model as random. To be more clear on this, we really just want to model Worker as random, because we want the levels of this factor to be representative of some population, and it just so happens that we can extend this random variable by modeling Machine as random within the context of Worker. Notice this is very different from specifying Worker and Machine separately as random effects (this is Leo's thought process not mentioned in the book but seems implied).

# Load in the data
library(MEMSS)
data("Machines")

# View the data
str(Machines)
summary(Machines)
xtabs(~ Machine + Worker, Machines) # Crossed

Below are some data descriptions of the Machines dataset. We see that there a three variables in the dataset, Worker, Machine, and score (the response variable). As mentioned, the levels of the two factors are crossed and the data are also balanced. Each worker used each Machine three times, which means it is possible to model these variables using random slopes.

Data Description

# Plot the data (Sick version I made)
Machines$Repetition <- c(1,2,3)
ggplot(Machines, aes(x = Machine, y = score)) +
  geom_boxplot() +
  geom_line(aes(group = interaction(Worker, Repetition), color = Worker), alpha = 0.5) +
  geom_point(aes(color = Worker)) +
  theme_classic()

With some creative code to visualize the data, we can clearly see the interaction that was mentioned prior. As a quick review, an interaction is when the responses of a level in a factor are dependent on the levels of another factor. By just focusing on main effects, we see that Machine A produces the lowest scores, followed by Machine B and then Machine C. When looking at workers, we see that some are consistently producing higher scores than others (ex: 3 > 5 > 1 >4 > 2 > 6), the issue is that we would expect the score for Worker 6 for Machine B to be lower than the other workers as it is, but to me much more closer to the median score of Machine B, and it is not, it produced scores less than that for Machine A. This is the interaction- and we should control for this to increase our model fit, and we have two reasonable approaches to doing this.

Visualizing the Interaction Between Two Factors

Fitting the Models

Below we are fitting three models, which as mentioned are increasing in complexity. The first model controls for the fact that there are several observations for each level of Worker- failing to include this can increase the chances of making a type 1 error since the model is believing that all observations are independent of each other, which they are not. Anyway, this first model will be the simple model that does not test for an interaction between the levels of the factors. The second model introduces random slopes- it is kinda complicated and I don't fully understand it, but it will produce k-1 variances for the random slopes and several correlations between these slopes and the random intercepts- this is by far the most complicated way to investigate the interaction. The third model only specifies two random effects, one for workers and the other for workers x machines- thus it is much simpler than the random slopes model.

# Fitting models with and without interactions
fm08 <- lmer(score ~ Machine + ( 1 |Worker), Machines, REML=FALSE)
fm09 <- lmer(score ~ Machine + (Machine|Worker), Machines, REML=FALSE)
fm10 <- lmer(score ~ Machine + ( 1 |Worker) + (1|Machine:Worker), Machines, REML=FALSE)

# Model comparison
anova(fm08, fm09, fm10)

Below we are comparing the three models to each other and can see that fm10, which is the model with two random effects, produces the better fit after considering the number of parameters that were used to estimate the mixed-model. As we can see, this model estimates only one more parameter than the first model while the other doubles it. This shows that controlling for that interaction aspect in the data overall produced a better fitting model than if we were to ignore it!

Model Comparison

We are going to just focus on the random effects for right now. Below we can see that three estimated variances for the random effects, one for Workers, the interaction between Workers and Machines, and the residual variance. When comparing these values to each other, we can see how almost all of the variance is explained by both random effects, leaving very little left for the residual variance. Thus, this is a good thing, this model is doing a good job I guess at capturing the variation of scores, to I guess make the interpretation of the fixed effects more accurate.

Best Fitting Model

Visualizing the Conditional Modes

# Checking random effects
dotplot(ranef(fm10, condVar=TRUE), strip = FALSE)

Honestly not really sure what the main takeaway from the plots below are besides that it is B:6 that is driving the difference- this is probably a cleaner way to visualize what we noticed in the graph above.

Worker Random Effects	Machine:Worker Random Effects

Interpreting the fixed effects

Now that we discussed the random effects in detail, we can switch over to making sense of the fixed effects. These are interpreted in the same way as they would be in a regular regression. In addition, we can also suppress the y-intercept, however this should be used only to decrease the work load in figuring out what the means are. Ideally we want to include the y-intercept so we can test the effects of the levels of the factors to the reference.

coef(summary(fm10))
fm10a <- update(fm10, . ~ . - 1)
coef(summary(fm10a))
AIC(fm10a, fm10)

Below we see the intercept (the expected mean of score for MachineA), the effects of MachineB and the effects of MachineC. When we say effects, this is basically the mean difference between this Machine compared to MachineA. We see that the mean score for MachineB is 7.97 units larger than MachineA and that the mean score for MachineC is 13.92 units larger than MachineA. This type of comparison is known as dummy coding contrast, meaning two levels are each compared to one reference level and that those two levels are not compared to each other. What is ideal about having your model produced like this are the p-values that correspond to the levels of the factor. These p-values show us whether there is a significant difference between that level of the factor compared to the reference (however, keep in mind these p-values are not adjusted to control for multiple comparison). We can also run a model where we suppress the y-intercept. All this does is estimate the mean for each level of the factor while not producing a y-intercept. Notice below that these means are equivalent to adding the effect to the y-intercept above. The downside to this model output is the p-values now represent whether or not the estimates are difference from 0, which for our purposes is not informative.

The AIC() function was used to demonstrate that both approaches produce identical model fits! They are basically the same, one is not better than the other in anyway.

Using `coef()` to Show the Fixed Effects

ergoStool Dataset (Skipped- not important)

Classroom Dataset

Chapter 5 (Skipped- not important)

Chapter 6: Generalized Linear Mixed Models for Binary Responses