Example of ERP Decoding - ucdavis/erplab GitHub Wiki
To explain the essence of ERP decoding, we’re going to start with the Bae (2021) study. The goal of this study was to examine the time course of face perception, comparing the perception of individual face identities and the perception of different emotional expressions. The experimental paradigm is shown in Figure 1A. Participants were shown a sequence of faces in random order. The faces were photographs of four different people (four different “identities”), with each face showing four different emotional expressions (neutral, anger, fear, happiness), leading to a total of 16 different photographs. Each face was presented for 500 ms, followed by a 1000-ms interstimulus interval. Each of the 16 faces was presented a total of 40 times to each participant. Participants performed a task that required them to maintain the most recent face in working memory until the next face appeared.
A machine learning algorithm was trained on a subset of the trials to “decode” which of the four face identities was presented, collapsed across which emotion the face was expressing. The decoder was then tested on data that weren’t used for training, using the ERPs to guess which of the four identities was presented for each test case. Because there were four faces, chance was 1/4 or 0.25. Figure 1B summarizes the decoding results. The X axis is time relative to stimulus onset, just like for an ERP plot. The Y accuracy is decoding accuracy (proportion correct; the probability that the decoder correctly guessed the identity of the face that produced a given averaged ERP). As you can see from the figure, decoding accuracy for face identity was near chance during the prestimulus interval but jumped up rapidly soon after stimulus onset.
This result indicates that the scalp voltages contained some information that could be used to distinguish the individual face identities from each other. That’s a miracle given that the electrodes are located on the skin, with a thick skull between the brain and the scalp! Moreover, the decoder was above chance even though the data were collapsed across the four different emotional expressions. This means that the scalp signal contained information about face identity that generalized across these four emotional expressions.
A separate decoder was trained to classify which of the four emotions was expressed in the photo, collapsed the identity of the person in the photo. Again, chance was 1/4, and decoding accuracy was well above chance. However, decoding accuracy rose more slowly for emotional expression than for face identity, indicating that the emotional expressions are extracted more slowly than face identities. That’s a conclusion that relies on the high temporal resolution of the ERP technique.
This example shows that ERP decoding is remarkably powerful given the limited spatial resolution of ERP signals. If you would have asked me 10 years whether we could decode face identities and emotional expressions from scalp ERPs, I would have said it was impossible. But it is quite possible! Our lab has also decoded many other things using ERPs, including (a) which of 16 different orientations a person is holding in working memory, (b) whether photographs of natural scenes are emotionally positive or emotionally negative, (c) the exact direction of coherent motion in arrays of moving dots, (d) orientations that are not consciously perceived due to binocular rivalry, and (e) the identities of letters and words.
Our decoding methods were developed by Gi-Yeul Bae when he was a postdoc in my lab (Bae & Luck, 2018; but see Bae & Luck, 2019 for a more appropriate statistical analysis approach). These methods were modified from methods that the Awh/Vogel lab developed for decoding alpha-band EEG oscillations (Foster et al., 2016) and methods that others have applied to MEG and EEG data (see, e.g., Grootswagers et al., 2017).
These methods were originally implemented via complex Matlab scripts. I was so impressed with the power of these methods that I decided that we should implement them in ERPLAB to make them more widely available. My goal was what I like to call Decoding for Everyone: software that would allow anyone to perform ERP decoding through the GUI, with no scripting required. Aaron Simmons did most of the actual programming. You will find that he did an amazing job of making ERP decoding easy, whether you are running it from the GUI or writing simple scripts to automate the process.
However, to use ERP decoding to provide solid answers to interesting questions, you need to understand the basics of how decoding works, along with the essential details of our implementation. If you don’t understand what you’re doing, it will be easy for you to find significant but bogus effects that are meaningless or even misleading. Fortunately, the essence of decoding is fairly simple when you break it down into its component parts. The next section will provide you with a basic understanding of decoding, and then the following sections will explain the details of our implementation.
1.1. The Essence of ERP Decoding
The goal of ERP pattern classification is to train a decoder to distinguish between two or more different classes of stimuli (or classes of behavioral responses) on the basis of the pattern of voltage across the electrodes (the scalp distribution). In the face identity decoding procedure in the study of Bae (2021), the decoder distinguished among four face identities on the basis of subtle differences in the scalp distributions for the different identities. Each face identity was a “class” and we had four different identity classes. When we decoded the emotional expressions, each expression was a class and we had four different classes of expressions.
To explain how decoding works, it’s easier to begin by discussing an example with only two classes. Here we’ll start by seeing how to decode whether the participant was viewing face identity 1 or face identity 2 (ID 1 or ID 2). Figure 2 shows decoding accuracy averaged over Participants 1-5. The shading around the decoding accuracy shows the standard error of the mean at each time point. Note that the original study of Bae (2021) used an unusually steep low-pass filter with a cutoff at 6 Hz, but this chapter will use a more typical low-pass filter with a half-amplitude cutoff at 20 Hz and a slope of 12 dB/octave.
To explain how this decoding worked, let’s start by imagining that we had only two electrodes, E1 and E2, and we were going to decode the face identity at 160 ms after stimulus onset. If we looked at the voltage at E1 and E2 at this time point for one trial with face ID 1, we could plot that pair of voltages as a dot in a two-dimensional space. This is shown in Figure 3A, which shows the voltage from E1 and E2 for many different trials of face ID 1 and face ID 2. Single-trial data are noisy, so there is quite a bit of trial-to-trial variation in the dot positions. However, if the scalp distribution differs at least slightly for ID 1 and ID 2, it should be possible to draw a decision line through the two-dimensional space that does a reasonable job of separating the trials for ID 1 and ID 2. As you can see in Figure 3A, there are more cases of ID 1 on one side of the line and more cases of ID 2 on the other side.
With real data we would have more than two electrodes, but this just gives us more dimensions. In the study of Bae (2021), there were 59 EEG electrodes. This gives us a 59-dimensional space, and each trial can be represented as a point in this 59-dimensional space. Instead of finding a 2-dimensional line to separate the points for ID 1 and ID 2, we need a 59-dimensional hyperplane. A 59-dimensional space is very difficult to imagine, but I find that it works quite well just to think about ERP decoding in two dimensions, so that’s how I will discuss decoding here. In the broader literature on machine learning, each of these dimensions is often called a “feature”.
There are many different machine learning algorithms that can be used to find the optimal line or hyperplane for distinguishing between two classes. The classic method is called linear discriminant analysis or LDA. A newer method is called a support vector machine or SVM. Each of these algorithms has advantages and disadvantages, but the SVM method tends to work better than the LDA method when the amount of data is modest, so we’ve used the SVM method in ERPLAB (although we plan to add LDA as an option in the future).
1.2. Cross-Validation
Even if there is no consistent information about two classes in the scalp voltages, and the points from different trials differ from each other only because of noise, it will always be possible to find a decision line or hyperplane that divides the space so that there are more points from one class on one side of the line and more points from the other class on the other side of the line. For example, if I took all the single trials from face ID 1 and randomly divided them into two sets, an SVM would be able to find a decision hyperplane in which more than 50% of the points from one set were on one side of the hyperplane and more than 50% of the points from the other set were on the other side of the hyperplane. This is sometimes called “overfitting”. The solution to this problem is to use a subset of the data to train the decoder (i.e., to figure out the decision hyperplane) and then test the decoder with data that were left out of the training set. This is called cross-validation, and it is absolutely essential. If you don’t cross-validate properly, your decoding results will be meaningless. A lot of the work of decoding therefore goes into the cross-validation procedure.
In the example shown in Figure 3A, there are 40 trials shown for each class. We could train the decoder on 39 trials for each class and then test with the one trial for each class that was not used during training. The two dots that are surrounded by boxes represent the test cases. We send the set of voltages for a given test case to the decoder without telling the decoder whether it is from ID 1 or ID 2 (i.e., it is an “unlabeled” test case), and the decoder must “guess” which class it came from by determining which side of the decision hyperplane it falls on.
We want to have as much data as possible for training the decoder, and this is why we use almost all of the data for training and hold out only one case from each class for testing. However, if we have only two test cases, it is difficult to get a precise measure of decoding accuracy. In other words, with only two test cases, the decoder can either be 100% correct, 50% correct, or 0% correct. To address this problem, we can repeat the process 40 times, each time leaving out a different pair of trials for testing. This would be called a 40-fold cross-validation. On each fold, we train a new decoder with 39 trials from each class and test it with the one trial that was left out from each class.
We then count the number of correct guesses and incorrect guesses across the 40 folds to quantify decoding accuracy. We have two test cases for each fold, so this gives us 80 test cases across the 40 folds. The decoding accuracy could be 0/80, 1/80, 2/80, etc., which we could also express as 0%, 1.25%, 2.5%, etc. Because we have two classes, and they are equally probable, chance performance would be 50% correct.
1.3. Averaging, Folds, and Iterations
If you are trying to decode something subtle like face identities from scalp voltages, the single-trial data are usually too noisy to yield above-chance decoding accuracy. The example shown in Figure 3A has much more separation between the two classes than we would typically find with real single-trial EEG data. With real data, the distribution of dots for two classes would be almost completely overlapping. It is therefore usually necessary to average multiple trials from a given class into averaged ERP waveforms prior to decoding. However, we can’t just make one average from all the trials from one class and another average from all the trials of the other class, because this wouldn’t give us enough cases for training and testing. For example, if we had one average for ID 1 and one average for ID 2, this would give us only one case from each class for training the decoder, and we wouldn’t have any data left out for testing.
The solution is to divide the data for each class into multiple subsets of trials and make separate averages for each subset of trials. In the study of Bae (2021), for example, there were 40 trials for each combination of the 4 face identities and the 4 emotional expressions. If we are asking whether we can decode identity independent of emotional expression, we have 160 trials for ID 1 and 160 trials for ID 2 after collapsing across all the trials for a given face ID, irrespective of the emotional expression.
We can randomly divide these 160 trials into 10 sets of 16 trials for a given ID, and we can then make an averaged ERP for each of these 10 sets of trials. This is illustrated in Figure 3B, in which we have 10 cases of ID 1 and 10 cases of ID 2, each coming from an average of 16 trials. Note that averaging reduces random variability, so the individual dots are not spread out as much for the averages as they were for the single trials. This also reduces the overlap between the set of dots for ID 1 and the set of dots for ID 2.
We then train the decoder using the voltages from 9 of the 10 averages from each class and test the decoder with the voltages from the 1 average from each class that were not used for training. In other words, this is a 10-fold cross-validation procedure. With two test cases for each fold (one for ID 1 and one for ID 2), this gives us a total of 20 tests. As a result, the decoding accuracy could be 0%, 5%, 10%, 15%, and so on. This is not very good resolution: it would be difficult to distinguish between 41% correct in one condition and 43% correct in another condition.
We can improve the resolution of our decoding accuracy by iterating this procedure many times, using a new random assignment of trials to averages on each iteration. For example, we might repeat the process 100 times, each with a different random allocation of trials to averages, and then average the decoding accuracy across the 100 iterations. With two cases for each fold, 10 folds, and 100 iterations, this would give us 2 x 10 x 100 = 2000 test cases, yielding possible decoding accuracy values of 0%, 0.05%, 0.10%, 0.15%, and so on. This yields much more precise and stable estimates of decoding accuracy.
When averaging is used, you must choose how many different folds to use (i.e., how many averages to make for each class), which then determines how many trials are averaged together in each averaged ERP. In the example I’ve been describing, we had 160 trials per class and used 10 folds, meaning that we had 10 averages in each of the four classes, each based on 16 trials. We could instead have 5 folds with 32 trials per average. Or we could have 20 folds with 8 trials per average. How should you choose the number of folds? If you have fewer folds, then you have cleaner averages but not as many training cases. If you have more folds, then you have more training cases but the data will be noisier. I know of no rigorous analyses of this for ERP data, but we have done quite a bit of informal testing. Our tentative conclusion is that it is best to have 10-20 trials per average, with as many folds as possible given that constraint (but a minimum of 3 folds). For example, in our first decoding paper, we had 40 trials per class, so we used 3 folds (3 averages per class) and 13 trials per average. However, we have not tested this rigorously, and the optimal balance of folds and trials may depend on the nature of the data being decoded, so I would not be surprised if our advice changed in the future.
1.4. Multiclass Decoding
The decoding approach shown in Figure 3A assumes that there are two classes being distinguished. This approach can easily be extended to cases with more classes, such as decoding which of four face IDs was present. One way to do this is to simply decode each of the possible pairs of classes, as illustrated in Figure 4A. In this example, we would get the decoding accuracy for ID 1 versus ID 2, ID 1 versus 3, and so on. This is called one-versus-one decoding. We could then average the decoding accuracy across each of these pairs of comparisons to get an overall decoding accuracy value.
An alternative is to use the error-correcting output codes approach (Dietterich & Bakiri, 1995). This approach begins with one-versus-all decoding, in which each decoder learns to distinguish between one class and the other classes, as illustrated in Figure 4B. In our face example, we would train one decoder to distinguish between ID 1 and IDs 2, 3, and 4, another decoder to distinguish between ID 2 and IDs 2, 3, and 4, and so on. To train the decoder that distinguishes between face ID 1 and face IDs 2, 3, and 4, we would train using 9 of the 10 averages for ID 1 as one class and then 27 of the 30 ERPs for IDs 2, 3, and 4 as the other class. To test decoding accuracy, the data for a test case are fed into all of the decoders, and the outputs of the decoders are combined into a single decision that provides the best guess for that case. We find that this is the best approach for multiclass decoding, so it is ERPLAB’s default.
ERPLAB also allows you to select one-versus-one decoding. When this is selected, ERPLAB still uses the error-correcting output codes approach. That is, each test case is fed into all of the decoders, and the outputs of all the decoders are combined to make a single guess for every test case.
1.5. Time Points, Participants, and Statistical Testing
In ERPLAB’s decoding procedure, we train and test a separate decoder at each time point. This takes advantage of the high temporal resolution of the ERPs and makes it possible to track the time course of information processing. For example, the Bae (2021) study found that information about face identity was decodable more rapidly than information and emotional expression.
Although a separate decoder is trained and tested at each time point, this does not mean that the results at one time point are independent of the results from other time points. EEG noise tends to be spread over multiple adjacent time points (which is technically called “autocorrelation” of the noise), and filtering causes additional spreading. It is therefore important not to assume independence in statistical tests. I will have more to say about that in a moment.
Decoding is done completely separately for each participant, and the participants can be treated as independent of each other (just as you would for any other kind of data analysis). In fact, a key reason why decoding can extract information that is not readily apparent in standard ERP analyses is that the decoding algorithm can find each individual participant’s ideal scalp distribution for distinguishing between the classes. By contrast, standard approaches work well only if there is a common scalp distribution across participants. For example, decoding doesn’t care if the difference between ID 1 and ID 2 is largest at Oz in Participant 1 and largest at Fz in Participant 2. However, standard approaches won’t work well in this situation.
Now let’s turn to statistical testing. The data used for statistical analysis will consist of a decoding accuracy value at each time point for each participant (or possibly a set of values for each of multiple conditions, as in the data shown in Figure 1). There are many possible approaches to statistical analysis for these data, but here are some common scenarios:
- In the simplest case, you will have one group of participants and two or more classes, and the basic question will be whether decoding accuracy is significantly different from chance at each time point. This can be addressed by computing a one-sample t test against chance at each time point and then using a correction for multiple comparisons (e.g., the false discovery rate correction).
- You might instead have multiple conditions for each participant, with the same set of classes in each condition. For example, you might have 4 face identities (4 classes) and vary whether the faces are attended or ignored (the two conditions). In this case, the basic question is usually whether decoding accuracy differs between conditions, which you can test using paired-samples t tests at each time point (again with a correction for multiple comparisons). This was the approach used to compare decoding accuracy for face identity and emotional expression in Figure 1.
- Another possibility is that you have two groups of participants (e.g., a patient group and a control group), with the same set of classes for each group. You would decode the classes in each participant and then compare decoding accuracy across the two groups with an independent-samples t test at each time point (again with a correction for multiple comparisons). However, there are some special considerations for comparing groups, especially if they may differ in signal-to-noise ratio (see Bae et al., 2020 for a detailed discussion). If you had 3 groups instead of 2 groups, you could do a one-way ANOVA at each time point instead of a paired t test at each time point.
- For each of the cases described above, you could compute mean decoding accuracy across a set of time points and compare this mean decoding accuracy against chance, across conditions, or across groups.
- When our decoding method is used without any fancy variations (such as training at one time point and testing at a different point), below-chance decoding accuracy can only happen as a result of noise. It is therefore appropriate to use 1-tailed statistical tests when comparing against chance, which will increase your statistical power. However, 2-tailed test should be used when comparing across conditions or groups unless you have a strong a priori hypothesis about which condition or group should have greater decoding accuracy.
ERPLAB does not perform statistical tests. Instead, you can export the decoding accuracy values to a text file, which you can then important into your favorite statistical analysis package.