Theory - OHBA-analysis/HMM-MAR GitHub Wiki

HMM-MAR produces a probabilistic segmentation of multichannel data into states or brain networks that are characterized by a certain signature, which contains spatial and (depending on the choice of the state distribution) spectral information. The inference procedure simultaneously estimates when the states happen and what are the parameters describing their probability distribution. This section aims to provide a short theoretical introduction about what this means, and how is done. Of note, the toolbox is named HMM-MAR for historical reasons, but other types of state probability distributions are also available apart from the MAR (multivariate autoregressive model). Currently, the available observation models are

  • The multivariate autoregressive model, or MAR.
  • The Gaussian model, with state-specific mean and covariance, or with state-specific mean, or with state-specific covariance. The covariance itself can be diagonal or a full matrix.
  • The time-delay embedded HMM, where each state can be interpreted as a truncated Gaussian process.
  • A decoding model, where each state is a decoding model relating brain data and stimulus or action.
  • An encoding model, where each state is an encoding model relating brain data to stimulus features, which can be inverted to give decoding style predictions.
  • A PCA model, where each state is a PCA projection, simultaneously encoding a time-varying functional-connectivity pattern and a time-varying dimensionality reduction.

This section, however, places some emphasis on the MAR, because the other models (except the PCA model) can be considered specific cases of the MAR (e.g. the Gaussian model is a MAR with autoregressive order equal to zero). A general reference containing also details on the inference can be found in (Vidaurre et al., 2018). To go straight to the practicalities (including ideas of what to do with the HMM results) visit the User Guide.

Contents of this page:


Model description

The HMM-MAR has two key ingredients: the Hidden Markov Model (HMM) and the Multivariate Autoregressive model (MAR). An HMM describes a time series as a sequence of states, where each state has its own model of the observed data (i.e., the observation model) and the probability of being at a given state at each time point depends on the state's assignment in the previous time point. The transition probabilities (how likely is to be in state j if we were in state i just before) are also estimated by the model.

The observation model that gives the name to the toolbox corresponds to a MAR model, which characterizes the behaviour of time series by linear historical interactions between the observed time series from different brain regions. In different words, a MAR model aims to predict the multichannel signal value at each time point as a linear combination of signal values at previous time points. MARs are able to characterize the frequency structure of the data, and by making the model multivariate, they are able to capture interactions (e.g., coherence) between multiple brain regions. Therefore, each state is related to a different set of multi-region autoregression coefficients describing the neural oscillations. When the order of the MAR model is set to zero, the states are modelled using Gaussian distributions, in which case the segmentation will be based on instantaneous activation and connectivity. For instance, the Gaussian distribution (of which we can consider the MAR distribution to be an extension), is what we used to model fMRI data in Vidaurre et al. (2017b). Other state distributions are also available: for instance, the time-delay embedded distribution described in Vidaurre et al. (2017c), or the leading eigenvector dynamics analysis approach introduced by Cabral et al. (2017). We refer to those papers for details, and to the User Guide for instructions about how to set up the HMM to use them.

We provide a full Bayesian formulation of the model, where every element of the model is described in terms of its probability distribution. Unfortunately, there is not a closed-form, analytical solution for the values of these parameters given the data; on these grounds, we use variational Bayes inference, described elsewhere. There is also the possibility of using stochastic variational inference, convenient when the input data set is large or very large. The output of the inference is then the HMM-MAR model, including the states' description and the transition probabilities, and the state time courses, that is, when each state is active. A supplementary output is a Viterbi path, defined as the most likely sequence of states. This is conceptually different from the state time course, which instead relates to the sequence of states that are individually the most probable. Also, the Viterbi path is a not-probabilistic assignment: unlike the state time courses, which are probabilities, for the Viterbi path we have at each time point a categorical assignment to one and only one state. For a more detailed description of the Viterbi path and the Viterbi algorithm, refer for example to Bishop's book.


MAR parametrisation

MAR models have an order parameter P, which defines how far we go back in time to predict each time point. The most typical MAR model uses a complete parametrization, where the prediction of each time point t uses lags (t-1,t-2,...,t-P). The crucial parameter is then the order P. Especially when the number of regions is high, it is easy that for the MAR to overfit the data. That is, a relatively simple MAR model can explain a lot of the variance, which will induce the HMM-MAR inference process to drop most of the states of the model by letting a few (or even one) dominant states to control the entire time series, thus hindering the discovery of quasi-stationary connectivity networks. There are a number of ways to get around this issue

  • To use a low order, for example 3, and, if needed, increase that value and see how the segmentation changes.
  • To work on a low-dimensional version of the data using PCA (see below).
  • To avoid using the cross-channel terms, such that each state is characterized by their spectra but not the cross-spectra

How to implement these options is detailed in the User Guide. In general, the best choice depends on the data and typically requires some trial and error for a first-time analysis. The use of the free energy (described below) as an automatic model selection criterion is also an option.


Targeting specific connections

The HMM-MAR toolbox allows for clamping certain connections to a certain global value so that they do not drive the state transitions. This can be used for investigating the transient dynamics of a particular set of connections. For example, we can force the HMM dynamics to be driven by just the spectral power of each channel separately by setting to zero the cross-channel connections, or we can focus on coherence modulations by holding the diagonal elements of the autoregression coefficient matrices to a fixed maximum likelihood value, or we might have a particular hypothesis, where we are interested only in the connections depicted in green in the Figure.

Network


Model selection

In order to determine the number of HMM states, K, a possible strategy is to fix the maximum number, and then the Bayesian inference process can discard some of them if there is insufficient evidence to support their presence in the data. The free energy, a quantity that approximates the model evidence, can also be used to choose K. For the other parameters (such as the order) we can utilise the free energy or leave-one-out cross-validation, where we first run the model for part of the data and, then, we estimate the state time courses for the left-out data (without modifying the states) and obtain the data likelihood or the fractional error. Note that cross-validation cannot be used to choose the number of states, because, considering the model has more states to choose from at each time point, it will always prefer the solution with the highest K. Cross-validation is perhaps more principled, but computationally more demanding, especially when the number of trials/segments/subjects is high.

The toolbox provides functions for both computing the free energy and performing cross-validation.

It is important to understand that free energy does not have any biological motivation. If the free energy prefers a certain number of states, that doesn't mean that there is a discrete number of things in the brain and we are finding that number. The free energy is just a trade-off between precision in representing the data and complexity; that is, it is an information criterion, not a biological one. Having a free energy optimal number only means that the amount of data allows us to have that level of complexity without much overfitting. As a consequence, if we augment the data, or if we make the time series longer, the free energy will just ask for more states. In the case of M/EEG data, for example, it is common for the free energy to decrease (get better) almost indefinitely as we increase the number of states. This is purely a consequence of M/EEG data having fine temporal resolution and many time points. In these cases, it is often preferable to choose the number of states using reproducibility criteria; for example, if we split the data into two halves, are the states reproducible across halves (see this as an example).


Spectral description of the states

Because of the MAR correspondence with the frequency domain, the state segmentation is driven by the spectral content of each separate channel and the spectral relations between channels. Once the HMM-MAR model is trained and we have the state time courses, we can estimate the spectral content of the model, including the following measures

  • Power spectral density (PSD)
  • Coherence
  • Partial directed coherence (PDC)
  • Phase

The estimation of these quantities can be done by one of these two methods:

  1. A parametric estimation, using the MAR coefficients.
  2. A nonparametric estimation, using a novel statewise version of the multitaper introduced in (Vidaurre et al., 2016). This would be the way to do it for example if the states were specified to be Gaussian distributions (i.e. MAR order equal to zero).

In the parametric case, we can use the MAR coefficients as returned by the model. These coefficients correspond to a maximum a posteriori estimation (which means that they have been regularised by the prior), and, as described above, can have an incomplete parametrization. For the sake of spectral estimation, we prefer to re-estimate the MAR coefficients using maximum likelihood, as is standard in the signal processing literature, and use a complete parametrization, i.e. using lags (t-1,t-2,...,t-P). This is the default behavior of the toolbox.

In the non-parametric case, we instead take advantage of the inferred HMM state time courses as temporal windows for estimating the state-specific spectral properties using a weighted version of the multitaper. PDC has not a closed solution in the non-parametric case, so that we need to use the so-called Wilson factorization, as described in (Vidaurre et al., 2016) and, previously, in (Jachan et al., 2008)

In general, because of its benefits in terms of accuracy and robustness, and the fact that we need to specify the order P in the parametric case, the non-parametric method is recommended unless there is a strong frequency component on the state time courses themselves, in which case the spectral content of the state time courses can contaminate the final estimation. To verify this, the same multitaper tool can be used on the state time courses to inspect for possible peaks in their spectrum.


Semisupervised learning

So far, we have described the HMM-MAR in a completely unsupervised setting. An alternative use for the toolbox is about prediction. The method could be used in the situation where we have pre-specified the value of the hidden state for segments of the time series so that this remained fixed all through the learning process. In a typical application, we would have labeled and unlabelled trials, the objective being to assign labels to the unlabelled ones. This can be considered a case of semisupervised learning and is of potential interest, for instance, in the field of brain–computer interfaces.

In the figure, the coloured segment represents periods of time where the states have been fixed, each colour being a different state, and the uncoloured part represents the segment where the HMM-MAR will need to estimate the state time courses.

semisupervised


Sign disambiguity

Source-localized MEG recordings suffer from an undesirable property: the equivalent current dipole model used in many source reconstruction techniques (e.g., beamforming) is unable to distinguish the polarity of a dipolar source, given that a dipole with a particular orientation will generate the same magnetic field pattern at the sensor array as one with the same orientation but opposite polarity. As such, the signs of the estimated source–space time course are arbitrary across cortical locations and across different recording sessions or subjects. As a result, the covariance between a pair of time courses may be positive in certain sessions, and negative in other sessions. This means that data from different sessions / subjects cannot be straightforwardly compared or averaged at the group level. Whereas this does not present a problem when we use the amplitude envelope of the signals (Baker et al., 2014), it is, however, a potential issue for MAR models, where we want to work with raw time courses.

In the HMM-MAR toolbox, the sign of each brain area is based on the assumption that the covariance matrix between each pair of brain areas (channels) has the same sign across different sessions (segments of data with independent source reconstructions). Under this assumption, the correct combination of signs will be the one maximizing the absolute sum of the covariance matrices across all pairs of channels and sessions, as the magnitude of this sum is obviously highest when the signs agree between trials. The gain function that we aim to maximize is represented in the figure.

signflipping

To achieve further robustness, we can include the lagged autocovariance matrices (for a certain number of lags) into the loss function, at the expense of some computation time.