Normalization, transformation and models - jsgounot/metagenomic-pipelines GitHub Wiki

This is mostly a summary page to help me for MaasLin2 usage. Please see this page first (related to this post) for this specific case.

This page ends up being a general summary of all these methods, which are far from intuitive (at least for me). You can still look at this MaasLin2 page for combination possibilities.

Generalized linear model

This first paragraph is just because I thought at one point to redo MaasLin2 using python. Someone tried to do it on R and did not find the same results. I wanted to do this because:

Make things easier to launch without an external call
Be able to debug some situation, like this one

It does not seem impossible at all, here are some resources for GLM online:

Compositional data

Although seldom acknowledged explicitly, count data generated by sequencing platforms exist as compositions for which the abundance of each component (e.g. gene or transcript) is only coherently interpretable relative to other components within that sample. This property arises from the assay technology itself, whereby the number of counts recorded for each sample is constrained by an arbitrary total sum (i.e. library size). Consequently, sequencing data, as compositional data, exist in a non-Euclidean space that, without normalization or transformation, renders invalid many conventional analyses, including distance measures, correlation coefficients and multivariate statistical models.

Compositional data have two unique properties. First, the total sum of all component values (i.e. the library size) is an artifact of the sampling procedure (van den Boogaart and Tolosana-Delgado, 2008). Second, the difference between component values is only meaningful proportionally [e.g. the difference between 100 and 200 counts carries the same information as the difference between 1000 and 2000 counts (van den Boogaart and Tolosana-Delgado, 2008)].

Source: Understanding sequencing data as compositions: an outlook and review

Compositional data is any data where the numbers make up proportions of a whole. For example, the number of heads and tails you obtain from flipping a coin say 10 times is compositional data. The main feature of these data is that they are ‘closed’ or constrained in the sense that if I tell you we got 6 heads from flipping a 2-sided coin, you know that there are 4 tails without me providing you with any further information. In probability terms, this means not all components in the data are independent of each other. Therefore, compositional data do not exist in the Euclidean space, but in a special constrained space called the simplex.

Source: Relative vs Absolute: Understanding Compositional Data with Simulations

Normalization

Normalization is the process where systematic variability is identified and removed. The choice of normalization depend of several factors: Is your input data abundance values or raw counts ? Can you use this normalization for the statistical model (for example, do you need positive values) ? How well the normalization work usually with this kind of data ?

TSS (Total Sum Scaling) [Abundance]

This method removes technical bias via simply dividing each feature count with the total library size to yield relative proportion of counts for that feature.

There are a couple undesirable properties for TSS. First, it is not robust to outliers, which are disproportionately large counts that do not reflect the underlying true abundance. Outliers have frequently been observed in sequencing samples due to technical artifacts such as preferential amplification by PCR (Aird et al., 2011). Several outliers could lead to the overestimation of the library size if not properly addressed. Second, it creates compositional effects: non-differential features will appear to be differential due to the constant-sum constraint (Tsilimigras and Fodor, 2016; Mandal et al., 2015; Morton et al., 2017). Compositional effects are much stronger when the differential features are highly abundant or their effects are in the same direction (not balanced). An ideal normalization method should thus capture the invariant part of the count distribution and be robust to outliers and differential features.

Source: GMPR: A robust normalization method for zero-inflated count data with application to microbiome sequencing data - for RNA data

CLR (Centered Log-Ratio) [Abundance]

CLR typically transforms the count data with respect to an in-sample reference. Part of the CoDA (Compositional Data Analysis) methods with Additive Log-Transformation (ALR), it provides a way to work well with compositional data such as abundance values as an alternative to spike-in normalization. Compared to ALR, user does not need to identify a reference feature, this transformation uses the geometric mean of the features (such as genes) or components as the reference, and therefore all results have to be interpreted with respect to the geometric mean.

At this stage, it is tempting to translate this assumption to mean that the geometric mean of the genes does not change between control and treatment. Maybe the geometric mean changes, maybe it does not, there is no way to know for sure without orthogonal information beyond the relative counts from sequencing. Most users of DESeq2 and other Differential Expression tools fall for this trap and conclude any significant changes called by the algorithms to mean significant changes in the absolute counts. Instead, these are just significant changes with respect to the geometric mean of all components.

Source: Relative vs Absolute: How to Do Compositional Data Analyses. Part — 2

Original paper: The Statistical Analysis of Compositional Data

CSS (Cumulative Sum Scaling) [Abundance - Count]

CSS is based on the assumption that the count distributions in each sample are equivalent for low abundant genes up to a certain threshold 𝑞𝑙̂ 𝑗, which is calculated from the data [20]. First, the median of each lth quantile across all samples is calculated. The threshold 𝑞𝑙̂ 𝑗 is set as the largest quantile where the difference between the sample-specific quantiles is sufficiently small (measured based on the distance to the median quantile). Note that the threshold is set to be at least the 50th percentile. The normalization factor for sample j is then computed as the sum over the genes counts up to the threshold 𝑞𝑙̂ 𝑗, i.e.

Source: Comparison of normalization methods for the analysis of metagenomic gene abundance data

Our first contribution is a novel normalization technique, the cumulative sum scaling (CSS) normalization, which corrects the bias in the assessment of differential abundance introduced by total-sum normalization (TSS), the most commonly used approach. TSS normalizes count data by dividing feature read counts by the total number of reads in each sample, i.e., converts feature counts to appropriately scaled ratios. TSS has been shown to incorrectly bias differential abundance estimates in RNAseq data derived through high-throughput technologies15, 16 since a few measurements (e.g., taxa or genes) are sampled preferentially as sequencing yield increases, and have an undue influence on normalized counts.

Original paper: Differential abundance analysis for microbial marker-gene surveys

TMM (Trimmed Mean by M-value) [Count]

This normalization works by inferring an ideal (i.e. unchanged) reference from a subset of transcripts based on the assumption that the majority of transcripts remain unchanged across conditions. Here, the reference was chosen to be a weighted and trimmed mean (Robinson and Oshlack, 2010), although others have proposed using the median over the transcripts as the reference (Anders and Huber, 2010). The TMM normalizes data to an effective library size based on the principle that if counts are evaluated relative to (i.e. divided by) an unchanged reference, the original scale of the data is recovered. In the language of compositional data analysis, this approach is described as an attempt to ‘open’ the closed data, and is often criticized on the basis that ‘there is no magic powder that can be sprinkled on closed data to make them open’ (Aitchison, 2003). Yet, if the data were open originally (and only incidentally closed by the sequencing procedure), this point of view is perhaps extreme.

Source: Understanding sequencing data as compositions: an outlook and review

TMM calculates the normalization factor N j using a robust statistics based on the assumption that most genes are not differentially abundant and should, in average, be equal between the samples [21]. First, a sample r is chosen as reference. For each sample j, genes are filtered based on their mean abundance and fold-change between the sample and the reference. An an adjustment 𝑓(𝑟)𝑗 is then calculated as the mean of the remaining log fold-changes weighted by the inverse of the variance. TMM normalization was performed using the edgeR Bioconductor package (version 3.10.5), which, by default, trims 30% of log fold-change and 5% of mean abundance [29].

Source: Comparison of normalization methods for the analysis of metagenomic gene abundance data

Original paper: A scaling normalization method for differential expression analysis of RNA-seq data

MaasLin2 comparison

As a final evaluation, we assessed the impact of various normalization schemes on the associated statistical modeling, evaluating all combinations of normalizations appropriate for each applicable method (Supplementary Fig. S1C, Methods). Focusing on the best-performing linear models, we found that model-based normalization schemes such as relative log expression (RLE33) as well as data-driven normalization methods such as the trimmed mean of M-values (TMM34) and cumulative sum scaling (CSS21) led to good control of FDR, but they also led to a dramatic reduction in statistical power (Supplementary Figs. S3, S5). In contrast, TSS showed the best balance of performance among all tested normalization procedures, leading to more powerful detection of differentially abundant features. These results have potential implications for other analyses in addition to differential abundance testing, as normalization is usually the first critical step before any analysis of microbiome data, and an inappropriate normalization method may severely impact post-analysis inference. In summary, our synthetic evaluation indicates that TSS normalization, although simplistic in nature, may be superior to other normalization schemes especially in the context of feature-wise differential abundance testing (and more generally for multivariable association testing, as described later), in addition to community-level comparisons as previously described. See here

Transformation

Transformation will modify the values distribution.

LOG

Simply apply a log transform to the data. The log transformation reduces or removes the skewness of our original data. The important caveat here is that the original data has to follow or approximately follow a log-normal distribution. Otherwise, the log transformation won’t work.

The log-transformation is widely used in biomedical and psychosocial research to deal with skewed data. This paper highlights serious problems in this classic approach for dealing with skewed data. Despite the common belief that the log transformation can decrease the variability of data and make data conform more closely to the normal distribution, this is usually not the case. Moreover, the results of standard statistical tests performed on log-transformed data are often not relevant for the original, non-transformed data.We demonstrate these problems by presenting examples that use simulated data. We conclude that if used at all, data transformations must be applied very cautiously. We recommend that in most circumstances researchers abandon these traditional methods of dealing with skewed data and, instead, use newer analytic methods that are not dependent on the distribution the data, such as generalized estimating equations (GEE). source]

LOGIT

The logit transformation is the log of the odds ratio, that is, the log of the proportion divided by one minus the proportion. The base of the logarithm isn’t critical, and e is a common base. The effect of the logit transformation is primarily to pull out the ends of the distribution. Over a broad range of intermediate values of the proportion (p), the relationship of logit(p) and p is nearly linear. One way to think of this is that the logit transformation expands the ends of the scale, such that small differences in p (say, going from 0.98 to 0.99) have a larger difference on the logit scale. (source)

AST (Arc Sine Square Root Transformation)

The arcsine transformation (also called the arcsine square root transformation, or the angular transformation) is calculated as two times the arcsine of the square root of the proportion. In some cases, the result is not multiplied by two (Sokal and Rohlf 1995). Multiplying by two makes the arcsine scale go from zero to pi; not multiplying by two makes the scale stop at pi/2. The choice is arbitrary. The effect of the arcsine transformation is similar to the logit, in that it pulls out the ends of the distribution, but not to the extent that the logit does. (source)

Comparison

For regression, the logit transformation is preferred for three reasons (Warton and Hui 2011). First, the logit scale covers all of the real numbers instead of being limited to a particular range. For example, just as proportion is limited to 0–1, the arcsine square root scale is limited to 0 to pi. In constrast, the limits of the logit scale are negative infinity and positive infinity. This is particularly important where prediction is needed, as having a bounded scale could give nonsensical results (e.g., more than 100% or less than 0%). Second, the logit scale is more intuitive in that it is the log-odds. This is particularly useful in interpreting slopes from a logistic regression, in which the logit transformation is central. Third, the logit scale correctly models the relationship between the mean and variance in binomial data, where variance is p(1-p)/n.

In multivariate studies, like ordination or cluster analysis, the arcsine transformation is preferred. For ecological data, proportions of 0% are common, such as when a species doesn’t occur in a sample. Values of 100% are also possible, such as when only a single species is present in a sample. In these cases, the range of the logit scale becomes a problem because values of negative infinity will occur whenever a species is absent from a sample and values of positive infinity will be arise in any monospecific sample. Although one could add a small value to prevent a zero proportion or subtract a small value to prevent a proportion of one, such values are arbitrary and the effect of the chosen value would have to be evaluated. An arcsine square root transformation would be more straightforward for these types of problems. Finally, because both transformations are essentially linear over the range of 0.3–0.7, neither transformation is necessary if all of your data falls in this range. (source)

Model

LM (Linear Models) [Non Count]

Assume a linear correlation between 2 distributions.

CPLM (Compound Poisson Linear Models) [Non Count - Positive]

Please see this paper for the CRAN cplm distribution.

Apart from that, LM is the only model that works on both positive and negative values (following normalization/transformation), and (as per our manuscript), it is generally much more robust to parameter changes (which are typically limited for non-LM models). Regarding whether to use LM, CPLM, or other models, intuitively, CPLM or a zero-inflated alternative should perform better in the presence of zeroes, but based on our benchmarking, we do not have evidence that CPLM is significantly better than LM in practice. MaasLin2 tutorial

NEGBIN (Negative binomial) [Count]

Negative binomial regression is used to test for associations between predictor and confounding variables on a count outcome variable when the variance of the count is higher than the mean of the count. Negative binomial regression is interpreted in a similar fashion to logistic regression with the use of odds ratios with 95% confidence intervals. Just like with other forms of regression, the assumptions of linearity, homoscedasticity, and normality have to be met for negative binomial regression. Source