Compositionality and You - Michael-D-Preston/PrestonLab GitHub Wiki

Introduction

Hi, the statistical analysis of DNA, bioinformatics, is hard. This page is meant to be an introduction to the main overarching problems that cloud the analysis so that the specific problems with each type can be described without rehashing the same thing over and over again.

Read me

I'm going to present a high level overview of this sorta stuff, the highlights! But there are many people have gone much more indepth into the matter. Here are a few papers that I've found really good references and worth the read if you have a minute:

Microbiome Datasets Are Compositional: And This Is Not Optional

A primer and discussion on DNA-based microbiome data and related bioinformatics analyses

Identifying biases and their potential solutions in human microbiome studies

Next gen sequencing

Next generation sequencing (NGS) is great! It allows for massively parallel DNA sequencing; this means you can sequence many many different DNA sequences at the same time. These sequences can be of a specific section of DNA, or barcode, based on primers (Referred to as metabarcoding or the marker gene approach), or the sequences can be based on any random DNA floating around in the sample (Referred to as whole genome sequencing (WGS)). Metabarcoding is considerably cheaper (~42 dollars per sample) than WGS (~200 dollars per sample), and thus metabarcoding is commonly used despite the fact WGS can be used to generate very similar data to metabarcoding. Both these techniques are considered "short read". This means when we use these methods to sequence samples individual sequences tend to be 150-200 base pairs (bp's) in length, this means when the forward or reverse reads are combined we can get a read around 400 bps in length.

The alternative to this is "long read" sequencing which can generate sequence lengths which are thousand+ bps long. Previously, long read sequencing has a considerably higher error rate than short read sequencing 1-5% compared to 0.24-0.06% per base; however, modern advances have made significant improvements to long read sequencing and some companies now boasts a 0.1% error rate

Irregardless, We are working with metabarcoding data which has its own set of challenges.

Primers and parables

When we preform metabarcoding we've chosen specific primers to target specific regions of the genome. These regions are typically variable regions and the variable regions that are important are different depending on different species (fungal versus bacterial species) and taxonomic depth (did these species diverge thousands of years ago, ten years, or millions of years ago).

Bacterial and archaeal primers

Bacterial and archaeal species have variable regions in the 16S ribosome section of the genome. There are 9 variable regions in this barcoding region (referred to as V1-V9). This region is about 1.5 kbp long and therefore short read sequencing cannot "pick up" the entire region. This means that typically one or two of the variable regions must be picked for specific metabarcoding applications. Unfortunately, these variable regions are not equally variable across all species so by choosing your primers you are creating systematic bias within which species you sequence. The primers you choose also influence which species are amplified due to GC content, amount of mismatches and so on. Choosing which primer set you use is an instrumentally important decision that affects the conclusions you come to, and which papers you can compare your results to (i.e. you should only compare your results directly with other experiments that use the same primers as you do).

Fungal primers

Fungal primers are a bit different than bacterial primers, there are two regions commonly used: Inter-transcribed spacer 1 (ITS1) and ITS2. Like bacterial and fungal primers the choice of which region to use will fundamentally change which species you detect.

How to decide on a primer set?

Reiterating, the choice of primers you choose is terribly important in the validity and comparability of your study. So where do you start? If you are working with fungal primers I quite enjoyed this paper. If you are working with bacterial species perhaps this paper, and for archaeal species this one.

A quick side point

Some people have started to combine long and short read technologies to get around the problems with purely short read sequencing. The applications of this are beyond the scope of this work but feel free to check it out!

systematic problems in NGS data

The major problem in NGS data is that the amount of reads metabarcoding generates is rather arbitrary and are more based on the machine and sequencing conditions than anything; further, this data does not fit specific statistical distributions in all contexts and contains alot of zeros. Overall, this means the common statistical tests that are used in most other data analyzes are invalid with microbiome datasets, and a data approach that addresses for compositionality and sparsity is necessary.

Sparsity

Most of the taxa found across samples will be present in only a couple of samples, this means there are alot of zeros within our taxa tables. This is called sparseness and will impact the choice of differential abundance, orientation methods and other statistical tests used. Naught all zeros in sequence count data are the same is a good introduction to this concept, give it a skim, especially figure 1.

Compositionality

Briefly, compositionality refers to data that is confined to an arbitrary constant sum. This is exactly what NGS datasets, which have an arbitrary amount of reads, fall under! This means that the characteristics of any one feature is dependent on the values of other features; therefore, when comparing samples you can only compare the relative abundance of characteristics (i.e. taxa) to each other as a comparison of sums would be comparing two arbitrary values!

So what's to be done?

Just for fun there is no standardization on best practices for microbiome data!!! whoooo!!!! The first thing is to do is use a compositional approach to analyzing your data, this means applying a center log transformation or associated transformation to your data before analyzing it with statistical tests that care about compositionality. How this is applied will be individually discussed within each statistical methodology.

Another quick side point

Want to skirt around some of these problems with compositionality and whatnot? Well you can turn your relative abundance data into absolute abundances with other data!! For my project I will be using count data (where I physically count the amount of microbes present) and combine it with metabarcoding data so I can compare absolute abundances of bacteria and archaea!

The center log transformation and more!

Alot of this will be based on A review of normalization and differential abundance methods for microbiome counts data so maybe give it a look through! There are a variety of normalization methods used through microbiome statistical analysis history and alot of them are still supported by common microbiome analysis packages so its important to have a brief understanding of the different options and why the center log transformation is the best.

The center log transformation

The center log transformation (clr) is used within a compositional data analysis (CoDA) approach. This method determines the geometric average of the amount of reads per taxa. Therefore comparisons are relative to this per sample average. Note: Since the logarithmic root is used within this analysis zeros in the dataset pose problems. Therefore, it is common to apply the clr on a dataset where each taxa has an abundance of abundance+1 to remove all zeros (This is referred to as adding a pseudocount). Alternatively, many people (220 citations on Web of science) have started applying a robust clr (RCLR) transformation to deal with the zeros (and therefore sparsity). RCLR designates each zero as a missing value and uses matrix completion to estimate the count of each taxa that should have been observed for each taxa that has an observed value of zero. This is similar to pseudocounting; however, it uses information about other samples rather than creating a baseline. All in all clr addresses compositionality which allows for the validity of standard multivariate analysis techniques.

Rarefying

"Rarefying is a popular but widely criticized technique". Need I say more? Rarefying works by randomly discarding reads until all samples have the same amount of reads. Rarefying is just bad, never discard your data! It has the potential to reduce statistical power and it doesn't address the problems with compositionality either!

Scaling

The simplest way to address differences in read # over samples is apply total sum scaling (TSS) which is simply determining the relative abundances of the samples (much like we did earlier in our relative abundance graphs). Unfortunately, a few highly abundant taxa systematically selected for from earlier choices (i.e. primer) can have a strong influence on differential abundance tests performed on TSS normalized data, nor does TSS address for compositionality.