Single cell RNA seq - settylab/single-cell-primers GitHub Wiki

Introduction

Single-cell RNA-seq is one of the great technological advances that has revolutionized the field. The scale and complexity of the data has allowed us to pose questions that were previously not possible. Our lab predominantly use the 10X Chromium system for scRNA-seq data generation. A quick introduction is available here.

The 10X system is based droplet microfluidics solution for single-cell RNA-seq. Two papers: InDrops1 & Drop-seq2 pioneered the use of microfluidics for single-cell RNA-seq profiling. Each cell is co-encapsulated in a microfluidic droplet with a primer bead which contains a unique identifier for cell and transcript. The cells are lysed within droplets and polyA-mRNA captured by the bead primers. As a result, each mRNA is tagged by a cell barcode, which is unique to the cell and an unique molecular identifier (UMI), which is approximately unique to each cell and gene combination. The tagged mRNAs are pooled across the cells for library prep.

While scRNA-seq offers tremendous promise, the data has plenty of caveats that we should be aware of:

The most important issue with scRNA-seq is drop-outs - only ~5-10% of the transcripts in a cell are captured by the primer beads and therefore the gene expression counts are a significant underestimate of the true transcriptome in each individual cell. Dropouts manifest in two important ways: (a) genes that are expressed might show up in the count matrix as 0 counts, (b) non-zero counts in the gene expression matrix is an underestimate of the true expression of the gene.
Ambient RNA: Ambient RNA is mRNA in the solution when cells are being encapsulated. This is typically a major issue when there is extensive cell death.
Doublets: The encapsulation of cells into beads is driven by a Poisson process. Therefore, there is a non-zero probability of two (or more) cells being encapsulated into the same microfluidic droplet creating doublets. Doublet rate increases with higher number of cells.
3' technologies: The data we typically use only capture the 3'ends of poly-adenylated transcripts.

A detailed description of single-cell RNA analysis is in 3. This primer presents the approaches/tools that have worked best in our lab.

Count matrix generation

10X CellRanger is used to generate a cells X genes count matrix. Typically, the Genomics Core runs the pipelines and count matrices are one of the deliverables. See here for example and description of outputs from the CellRanger pipeline. The link shows the output for the PBMC 10k dataset from 10X, which we will use in this primer. Important to examine the summary file to check if there are warnings or errors with the sample. The starting point for our analysis is filtered_feature_bc_matrix.h5 which is a H5 file representing gene expression. CellRanger performs a number of filtering steps to generate this matrix and is a good starting point for our analysis and preprocessing.

Scanpy

We extensively use scanpy for preprocessing and analysis. scanpy has excellent tutorials and documentation and as mentioned before, this primer picks out the settings and tools that has worked best in our hands.

Preprocessing

Cell filtering

Molecule count and doublets

While 10X does provide a filtered matrix, the filtering is permissive. A plot of the molecule count distribution per cell is shown below.

Low molecule "cells" are a result of ambient RNA and very high molecule cells are most likely doublets. We will manually set thresholds to remove the outlier cells. While this is rather subjective, it works reasonably well in practice. EmptyDrops is a method to automatically remove empty droplets but gives mixed results. Doublets can be detected with different methods: DoubletDetection and scrublet both work well in practice. Both methods are based on creation of artificial doublets in the data to detect true doublets. The set thresholds are shown below.

When molecule count distribution does not provide a clear picture, it is good to consider the gene count distribution as well.

The molecule count distribution after filtering is shown below.

Mitochondiral filter

Based on the project, another important filter is to remove cells with high mitochondrial content, an indicator of dying/dead cells. A plot of total molecule counts -vs- fraction of mitochondrial content is shown below.

The cell death rate is very low here - not surprising for PBMCs. We typically remove any cells with >20% of mitochondrial molecules. Cells below are colored by density.

Normalization

Molecule counts need to be normalized across cells given the large differences. Typically, normalization is performed by dividing the counts by total molecule counts for each cell. The data is then log transformed - note it is important to use a pseudo-count of 0.1 and not 1. See commentary regarding this here. There are more sophisticated methods to normalize data such as scran and scTransform, which also work well in practice.

Gene selection

The cellranger count matrix provides information has entries for all genes, including these that are not detected across any cells. We typically first remove genes that are detected in <10 cells. We then do a more active feature selection and choose the most highly variable genes in the data. While the number is subjective, 1500-2500 genes are typically sufficient to derive the signal in the data. The signal in the data can be masked if gene selection is not performed since cells artificially appear more similar to each other because of house keeping genes.

Analysis

PCA

PCA is the first dimensionality reduction step after preprocessing. PCA is an essential step to overcome the noise due to dropouts. We take advantage of the covariate structure among genes to identify "meta-genes" that are represented by principal components. While these are linear combinations of genes, they provide a good starting point to perform downstream analysis. The number of components can be chosen by either the knee point or fraction of variance explained (E.g.: 85%)

Clustering and differential expression

Clustering of cells can be performed using leiden or phenograph. Clustering of cells is useful for both annotating cell types and for differential expression. While scanpy does provide a test for differential expression, MAST performs takes a more statistically robust approach and performs better. MAST is written in R and a simple usage is documented here.

Diffusion maps and trajectory analysis

Diffusion maps are a non-linear dimensionality reduction technique that have been very powerful in describing the underlying manifold in single-cell technique. Diffusion maps are particularly useful for modeling pseudo-time trajectories from single-cell data. We compute diffusion maps using the Palantir package. Palantir is a trajectory detection algorithm developed by the lab. More details and usage is available here. A general introduction and benchmarking to trajectory detection algorithms is here.

Imputation

Imputation is technique used for correcting the drop-outs in the data. While a number of techniques are available for imputation, the use of imputation should be done in a judicious and careful manner. We typically use MAGIC imputation and use it predominantly for gene expression visualization and gene trend estimation. Imputed data is also powerful for annotating cell types using known gene markers.

Visualization

UMAPs

UMAPs have emerged as a powerful visualization for single-cell data. UMAPs can be computed using scanpy using principal components as the starting point. UMAPs with cells colored by log10 molecule count for the PBMC data is shown below.

This is an important QC plot to ensure that normalization worked properly and there is no bias based on molecule count distribution.

The same UMAP now colored by leiden clusters is shown below.

Force directed layouts

Force directed layouts are an alternative visualization which have proven to be very powerful for visualizing data that are continuous trajectories. See here for a demonstration of how force directed layouts can be used to accurately represent the structure in the data. The choice of kernel is vitally important to derive an accurate representation of the data. We use a density adaptive kernel as described in the MAGIC and Palantir papers

While force directed layouts are not suitable for datasets with discrete clusters, an example visualization for the PBMC data is shown below (colored by leiden clusters).

Gene expression

Imputed gene expression can be visualized on UMAPs/FDLs to identify cell types. Expression of few marker genes is shown below .

Gene signatures

Gene expression signatures are very useful to annotate cell types or understand behavior of signaling pathways. The aggregate expression of signature can typically better inform cell types rather than expression of individual genes.

Cell types

Cell type annotations can also be visualized on UMAPs.

Batch effect correction

Batch effect correction might be necessary when comparing multiple samples. The best way to diagnose batch effects is to perform the standard analysis and plot the batches on UMAPs/FDLs. Cells separated by batch on UMAPs is a strong indication of batch effects. Harmony, Scanorama andmnnCorrect all work well. All of these methods are available through scanpy.

This paper has a thorough comparison of different batch effect correction methods.

Word of caution: When dealing with patient data, differences between patients might be true biological effects and not batch effects - this needs to be assessed on a case-by-case basis.

Sample notebook - Putting it all together

Jupyter notebook detailing all the analysis for the PBMC RNA dataset is available here

Alternative approaches

Deep learning approaches such as scVI provide an all-in-one generative model for imputation, batch effect correction and cell type label transfer. The key drawback of such approaches is that the reduced dimensional space from the auto encoder are bit of a black box with no easy to interpret what the dimensions represent. There is a lot of on-going work in understanding this black box and making the deep learning models more interpretable.

Additional reading

Additional tutorials for the TFCB course are available at Lecture 18. Recording of the course lecture are available at Lecture 18 and Lecture 19

A more technical treatment of the biases and correction approaches to single-cell RNA-seq data is available here.