Interpretation of Decoding Accuracy and Important Confounds - ucdavis/erplab GitHub Wiki

What does it mean if decoding accuracy is above chance? And what does it mean if decoding accuracy differs across time points, conditions, or groups? This section will answer these questions. It will also deal with two issues that are common potential confounds, namely variations in trial-to-trial variability and variations in the number of trials. If you don’t understand these confounds, you are likely to draw invalid conclusions. In the following subsections, we will discuss three key principles of interpreting ERP decoding results.

The best way to think about decoding accuracy is as an index of the information content of the scalp ERP signal. That is, decoding accuracy is monotonically related to the amount of information (in the pattern of voltage over the scalp) about which class was presented. If you find that decoding is significantly above chance, then you can conclude that the scalp ERP signal contained information about which class was presented. If you find that decoding accuracy is greater in one condition than in another, you can conclude that the scalp ERP signal contained more information about which class was presented in the first condition than in the second condition. If you find that decoding accuracy is greater in one group of participants than in another group, you can conclude that the scalp ERP signal contained more information about which class was presented in the first group than in the second group. However, these conclusions are valid only if the noise level and number of trials have been properly controlled.

To understand this better, let’s think about the factors that determine decoding accuracy. One key factor is how different the scalp distributions are for the two classes. This is illustrated in Panels A and B of Figure 5. The sets of dots for the two classes are the same in these two panels, except that they’ve been shifted closer together in Panel B. As a result, only 7 of the 10 dots for each class fall on the correct side of the decision line in Panel B, whereas 9 of the 10 dots for each class were on the correct side of the line in Panel A. (Note, however, that 70% decoding accuracy is actually quite good, and Panel B shows that decoding can pick up on quite subtle differences between classes.)

The other main factor that determines decoding accuracy is trial-to-trial variability in the ERPs. This is illustrated in Panel C of Figure 5, which shows how greater variability in the data used to train and test the decoder leads to poorer decoding accuracy. If you have a lot of trial-to-trial variability, you will have low decoding accuracy. Because we typically average multiple trials together prior to decoding, and averaging reduces variability, we can reduce the variability by increasing the number of trials being averaged together. So, if you want to decode something subtle, you should record the cleanest possible data and have as many trials as possible. In my lab, we put extra effort into recording clean data in our decoding experiments. For example, because the signal-to-noise ratio is influenced by electrode impedances (Kappenman & Luck, 2010), we take extra time during the electrode application procedure in our decoding experiments to reduce the electrode impedances to below 20 KΩ.

I like to think of this in terms of the cross-validation procedure, in which averaged ERPs created from one subset of the trials are used to train the decoder and averaged ERPs from a separate subset of trials are used to test the decoder. Any differences between the averaged ERPs used to train the decoder and the averaged ERPs used to test the decoder may “shift” the test cases to the wrong side of the decision line, reducing our decoding accuracy. One very different trial may be enough to produce a change in the averaged ERP that leads to a large reduction in your decoding accuracy, especially if you don’t have a large number of trials per average.

With these factors in mind, we can return to the idea that decoding accuracy is an index of the amount of information in the pattern of voltage over the scalp about which class was presented. If the differences between the voltage patterns for the different classes are large relative to the variability within each class, then the voltage patterns contain a lot of information about which class was present. You can also think of decoding accuracy as reflecting the signal-to-noise ratio, where the signal is the difference in the voltage patterns produced by the different classes and the noise is the trial-to-trial variability in these voltage patterns. Consequently, decoding accuracy can be a very useful metric, because we’re often interested in the ability of the brain to form consistently different neural representations of different stimuli (e.g., a consistent difference between the neural representation of person 1 and person 2), and the concept of consistency is well captured by the ratio of the signal to the trial-to-trial variability in that signal.

2.1. Potential Confound: Differences in Noise or Number of Trials Between Conditions or Groups

Although it can be a good thing that decoding accuracy reflects the signal-to-noise ratio, this also creates the potential for confounds because much of the noise (the trial-to-trial variability) arises from non-neural sources (e.g., muscle noise, skin potentials, movement artifacts). This leads to our first principle of interpreting ERP decoding results:

Principle 1: If you are comparing the decoding accuracy across different time periods, across different experimental conditions, or across different groups of subjects, any differences in noise level could produce differences in decoding accuracy.

For example, imagine that you were trying to decode face identity in people when they were fully rested and when they were sleep-deprived. Movement artifacts that are unrelated to face identity might be be larger when the people are sleep-deprived, and this will cause greater trial-to-trial variability in the voltage at a given latency. This difference in artifacts might cause decoding accuracy to be lower when people were sleep-deprived, even if their face representations were not impacted by the amount of sleep.

You might think you could solve this problem by rejecting trials with artifacts. However, the noise level depends on the number of trials included in each average, so decoding accuracy can be influenced by artifact rejection. This leads to our second principle of ERP decoding:

Principle 2: Differences in the number of trials across experimental conditions or groups of participants will cause differences in decoding accuracy between these conditions or groups. It is usually necessary to equate the number of trials across groups and conditions.

For example, if more trials are rejected because of artifacts or behavioral errors in a patient group than in a control group, this could cause lower decoding accuracy in the patient group. Similarly, if you are comparing children of different ages, younger children might have more trials rejected because of movement artifacts than the older children, and this could cause lower decoding accuracy in the younger children. You need to equate the number of trials across groups or conditions to obtain valid decoding results.

Although these issues sometimes arise with traditional ERP methods, the noise level and number of trials typically have a much larger impact on ERP decoding accuracy than on most standard ERP analyses. As a result, you must be extra careful to avoid confounds related to the noise level or the number of trials when performing decoding.

ERPLAB’s decoding tools allow you to equate the number of trials used for decoding, which we call “flooring” the number of trials (because the group or condition with the smallest number of trials provides a floor that we use across groups or conditions).

ERPLAB has no way to control the amount of trial-to-trial variability, so that’s up to you. This is particularly challenging when comparing decoding accuracy for different groups of participants (e.g., a patient group and a control group). Gi-Yeul Bae and I faced this issue when we wanted to compare decoding accuracy for people with schizophrenia and matched control subjects. To address this issue, we developed a method for quantifying the signal, the noise, and the signal-to-noise ratio for decoding (Bae et al., 2020). It’s not a perfect method, but it’s certainly better than nothing. If you come up with a better method, please let me know!

2.2. Potential Confound: Differences in Noise or Number of Trials Between Classes

In addition to worrying about differences between groups or conditions in the number of trials, you also need to worry if there are differences in the number of trials across the classes being decoded within a group or condition. This is because decoding accuracy becomes artificially inflated when there are differences between classes in the number of trials (or in the noise level). In the face identity decoding experiment, for example, if we rejected more trials because of artifacts for ID 1 than for the other IDs, we would have different numbers of trials across the four IDs, and this would artificially inflate the decoding accuracy.

In general, differences in noise level or number of trials across classes can lead to above-chance accuracy even when the classes produce equivalent single-trial brain signals. This is illustrated in Figure 6, which shows two classes that have the same average scalp distribution but differ in variability. In other words, the average position of the dots in the two-dimensional space (the centroid) is the same for the blue dots in Class 1 and the red dots in Class 2, but the dots are more spread out in Class 2 than in Class 1. This typically arises when there are fewer trials in for one class than for another, leading to noisier averaged ERPs in one class than in the other.

When the trial-to-trial variability is greater for one class than for the other (usually because of a difference in the number of trials per average), the class with greater variability will produce more training and test cases with extreme values. Consequently, any case that falls farther away from the centroid is more likely to belong to the class with greater variability. The decoder will pick up on this regularity, resulting in a decision line that is relatively far away from the centroid. This decision line will lead to above-chance decoding accuracy even though the classes differ only in noise level. Panels A and B of Figure 6 show two different decision lines that would work well with the same data. Decoding accuracy will never be perfect in these scenarios, but it can certainly be above chance. This leads to our third principle of decoding:

Principle 3: Differences between classes in trial-to-trial variability or in the number of trials will artificially inflate the decoding accuracy. It is therefore necessary to equate the number of trials used for decoding across classes.

We have found that even modest differences in the number of trials per class will lead to bogus above-chance decoding accuracy. For example, imagine that we are excluding trials with behavioral errors or artifacts, so there are a few more trials for Class 1 than for Class 2 in some participants and a few more trials for Class 2 than for Class 1 in other participants. This can cause above-chance decoding accuracy in those participants, regardless of which class contains more trials. Moreover, because decoding is performed separately for each participant, this is still a problem even if the average number of excluded trials across participants is the same for the two classes.

The solution to this problem is simple but often painful: You need to subsample at random from the available trials for the class with more trials to equate the number of trials across classes (i.e., “flooring” the number of trials across classes). If you have two classes of stimuli that are equally probable, and you don’t need to exclude many trials because of artifacts or incorrect behavioral responses, equating the number of trials for the two classes might mean excluding only a few extra trials for each participant. However, the more classes you have, the greater is the probability that one of them will have quite a few trials rejected. For example, if you have 20 classes, you might reject an average of 10% of trials per class, but you might have one class in which 50% of the trials were excluded. You’d then need to exclude 50% of the trials from every class prior to decoding. Alternatively, you might design an experiment in which you intentionally make one class frequent and the other class infrequent, and you will need to exclude a large proportion of trials from the frequent class before decoding to equate the number of trials per class. It can be very painful to exclude so many trials given that you know this will decrease your decoding accuracy.

However, it is essential to equate (or “floor”) the number of trials before decoding. ERPLAB therefore includes this as a “strongly recommended” option. You can disable this option, but you will get a strongly-worded warning message because there is almost never a good justification for disabling this option.