B2 IV: scRNAseq - bcfgothenburg/HT23 GitHub Wiki

Course: HT23 Bioinformatics 2 (SC00038)

The purpose of these exercises is to introduce you to the post-processing of scRNA-seq data. You will assess the quality of the data and filter accordingly. Then you will follow the current standard for expression analysis of scRNAseq data.

Your task is to identify clusters of genes with similar expression as well as possible marker genes for these clusters that could be significant for mammary epithelial cell development.

Due to the limit in time, you won't be processing raw data, but rather interpreting the results from a pre-generated count data.

The Data

Today there are a number of sources for both raw single cell data and processed count data for running similar types of analysis available through 10X Genomics as well as other resources such as ArrayExpress. The dataset that you will work with comes from the scRNAseq package from Bioconductor which contains gene-level counts for a collection of public scRNAseq datasets as SingleCellExperiment objects (ready to be processed in R).

Specifically, we will work with mouse mammary gland single-cell RNAseq data from Bach et al. (2017). Characterising mammary epithelial cells and their regulation during development is crucial for profiling breast cancer transcriptomics thus today we will explore the different developmental stages and the marker genes that might be distinguish as significant.

features.tsv

barcodes.tsv

Ensembl ID	Gene symbol
ENSMUSG00000089699	Gm1992
ENSMUSG00000102343	Gm37381
ENSMUSG00000025900	Rp1
ENSMUSG00000109048	Rp1
ENSMUSG00000025902	Sox17

Cell
AAACCTGAGGCCATAG-1
AAACCTGCATCGGGTC-1
AAACGGGTCAAACGGG-1
AAAGATGAGATAGCAT-1
AAAGATGAGATATGCA-1
AAAGATGAGTACGTTC-1

matrix.mtx

	cell_1		cell_2	cell_3	cell_4	cell_5	...	cell_n-2	cell_n
Gene_1	0	0	0	5	1	0	...	1	0
Gene_2	0	0	0	3	1	0	...	0	0
Gene_3	1	0	0	0	0	1	...	0	1
Gene_4	1	0	1	0	1	0	...	1	0
Gene_5	0	1	0	0	671	8	...	0	0

scRNAseq counts from the standard CellRanger pipeline will provide you with the files represented in the tables above where the barcodes file contains cellular barcodes present in dataset, a feature file has the IDs of quantified genes, and a matrix file contains the count values where rows are associated the gene IDs and columns correspond to the cellular barcodes. Since we are using data from a SingleCellExperiment object, we will have the same information provided in a slightly different format.

You will follow the typical steps in the preprocessing of scRNAseq data for downstream analysis and interpreting the quality metrics. There are many single cell RNA data analysis tools available both in and outside of R. Some popular R packages include SingleCellExperiment, Monocle, slingshot, scater and the current standard for single cell analysis which we will use today is Seurat. This practical follows a similar workflow to what is outlined in the Seurat vignettes.

Most of the code is copy/paste. Modify when asked, this will be indicated in CAPITAL letters

Read data

First install the following libraries (if you already have them just load them):

library(Seurat)if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")

BiocManager::install("scRNAseq")
BiocManager::install("ensembldb")
BiocManager::install("scater")

install.packages("dplyr")
install.packages('Seurat')

The, load them:

library(dplyr)
library(Seurat)
library(scRNAseq)
library(ensembldb)
library(scater)

Now we can load the mammary epithelial cell counts from the scRNAseq package in R:

# Obtain scRNAseq data for all samples in Bach et. al 2017
sce <- BachMammaryData(samples= c("NP_1", "NP_2", "G_1", "G_2", "L_1", "L_2", "PI_1", "PI_2"))
sce

Let's look a some attributes of the dataset:

dim tells you the rows and columns of the dataset
rownames shows the name of the gene that is being quantified
colData gives the different meta-data (or annotation) we have for each value, in this case we know the Barcode, the Sample and the Condition.

Q1. How many cells are there in the dataset?

Q2. What are the different conditions we are testing? Hint: use unique(sce$Condition)

Seurat has interoperability between different single cell data formats, in this case with a SingleCellExperiment object. If we were to try to convert our dataset directly we would see an error since our dataset is raw counts only. In order to use our SingleCellExperiment data with Seurat, we need to create an assay for logcounts. This is a typical example of data wrangling!

# Perform logNormCounts on SingleCellExperiment object
sce <- logNormCounts(sce)

# Original counts
sce@assays@data@listData[["counts"]][8:18,1:5]

# Log counts
sce@assays@data@listData[["logcounts"]][8:18,1:5]

# Original gene annotation
head(rownames(sce))

# Replace Ensembl IDs with GeneName and EnsemblID
rownames(sce) <- paste0(rowData(sce)$Symbol, ".", rowData(sce)$Ensembl)

# Modified gene annotation
head(rownames(sce))

# Convert SingleCellExperiment object to SeuratObject
sce.to.seurat <- as.Seurat(sce, 
                           counts = "counts", 
                           data   = "logcounts")
sce.to.seurat

# Create the standard RNA assay object from your counts
sceAssay <- CreateAssayObject(counts = sce.to.seurat@assays$originalexp@counts)
sce.to.seurat[["RNA"]] <- sceAssay

Now we can follow the standard pre-processing workflow for scRNA-seq data in Seurat.

QC and Filtering

First we must select and filter cells from the dataset using some QC Metrics that are commonly used in single cell analysis:

Number of genes in each cell (nFeature_RNA), molecules that have at least one UMI count
Total number of molecules within each cell (nCount_RNA), the total counts per cell
Percentage of reads that are mitochondrial (percent.mt)

# Calculate the percentage of reads that map to the mitochondrial genome.
sce.to.seurat[["percent.mt"]] <- PercentageFeatureSet(sce.to.seurat, pattern = "^mt.")
head(sce.to.seurat[["percent.mt"]])
 
# Visualize the QC metrics in order to determine the appropriate thresholds for removing unwanted cells.
VlnPlot(sce.to.seurat, features = c("nCount_RNA", "nFeature_RNA", "percent.mt"), ncol = 3)

# OPTIONAL
# If the above seems unclear, the FeatureScatterPlot is another way to visualize the QC metrics

FeatureScatter(sce.to.seurat, 
               feature1 = "nCount_RNA", 
               feature2 = "percent.mt")

FeatureScatter(sce.to.seurat, 
               feature1 = "nCount_RNA", 
               feature2 = "nFeature_RNA")

Q3. What does the percentage of mitochondrial reads tell you about the cell?

Q4. What does a low nFeature_RNA indicate?

Q5. What does a high nCount_RNA indicate?

Based on these plots, we need to decide on suitable thresholds to filter our data. Each dataset is generated under different conditions and the same assumptions may not apply to all experiments, so take a little time to understand if what you see makes sense.

Think about the thresholds you would choose. Set up at least one threshold more besides the MAX_percent.mt:

MAX_nFeatures: ____
MIN_nFeatures: ____
MAX_nCount: ____
MIN_nCount: ____
MAX_percent.mt: ____

Run the following code replacing the values you chose under # Thresholds. Remember to remove the # of the values you will be using in both under # Thresholds and the subset function:

# Thresholds
# MAX_nFeature = ADD_YOUR_THRESHOLD
# MIN_nFeature = ADD_YOUR_THRESHOLD
# MAX_nCount   = ADD_YOUR_THRESHOLD
# MIN_nCount   = ADD_YOUR_THRESHOLD
MAX_percent.mt = ADD_YOUR_THRESHOLD

# Use your suggested thresholds for the QC metrics to subset the data without the unwanted cells.
sceDataSubset <- subset(sce.to.seurat, 
                        subset = percent.mt < MAX_percent.mt 
                            #  & nFeature_RNA > MIN_nFeature  
                            #  & nFeature_RNA < MAX_nFeature
                            #  & nCount_RNA > MIN_nCount
                            #  & nCount_RNA < MAX_nCount 
                        )
sceDataSubset

VlnPlot(sceDataSubset, 
        features = c("nCount_RNA", "nFeature_RNA", "percent.mt"), 
        ncol     = 3)

Q6. Why do we want to remove potential outliers?

Q7. How many cells and genes did we start with and how many were removed after filtering?

Normalization

Once we have removed unwanted cells we next normalize the data:

# Original values 
sceDataSubset[["originalexp"]]@data[8:18,1:5]

# The default values are provided here for clarity
sceDataSubset <- NormalizeData(sceDataSubset, 
                               normalization.method = "LogNormalize", 
                               scale.factor         = 10000)

# Normalized values 
sceDataSubset[["originalexp"]]@data[8:18,1:5]

Q8. Why do we Normalize the data?

Finding Variable Features

In order to shrink the data further for downstream analysis and highlight biological signals, we next need to perform feature selection to identify a subset of our genes that have high variation from cell to cell. The vst method and 2000 top genes are used by default, however, different methods can be useful depending on the research question:

sceDataSubset <- FindVariableFeatures(sceDataSubset, 
                                      selection.method = "vst", 
                                      nfeatures        = 2000)

# Identify the 10 most highly variable genes
top10 <- head(VariableFeatures(sceDataSubset), 10)
top10

# plot variable features with labels
LabelPoints(plot   = VariableFeaturePlot(sceDataSubset), 
            points = top10, 
            repel  = TRUE, 
            xnudge = -2, 
            ynudge = -2)

Make a note of the top10 genes.

Q9. Do you think 2000 variable features is reasonable for the number of features we started with? What about after filtering?

Scaling

The last step before dimensional reduction is scaling the data so that the mean expression across cells is 0, the variance is 1, and so that highly-expressed genes do not overpower the results. There are other options to remove unwanted sources of variation such as SCTransform but the basic ScaleData function is used for simplicity.

sceDataSubset <- ScaleData(sceDataSubset, 
                           features = rownames(sceDataSubset))

GetAssayData(sceDataSubset, slot = "scale.data")[8:18,1:5]

Dimensionality reduction

Now that the data has been normalised, scaled and we found variable features, we can perform dimensionality reduction on the data in order to cluster gene expression profiles.

The first step in reducing the dimensionality of your data is a PCA using the variable features as input.

# Current reductions
sceDataSubset@reductions

sceDataSubset <- RunPCA(sceDataSubset, 
                        features = VariableFeatures(sceDataSubset))

# Current reductions
sceDataSubset@reductions

There are a few ways to visualize the cells and features in the PCA such as VizDimReduction(), DimPlot(), and DimHeatmap() which you can explore if you have time.

# Visualize top genes associated with reduction components
VizDimLoadings(sceDataSubset, 
               dims      = 1:4, 
               reduction = "pca", 
               nfeatures = 20)

# 2D scatter plot of the dimensional reduction
DimPlot(sceDataSubset, 
        reduction = "pca", 
        split.by  = "Condition")

# OPTIONAL
# you can use "Sample" instead of "Condition"
# and group.by instead of split.by

# Heatmap featuring cells and genes sorted by their principal component scores
DimHeatmap(sceDataSubset, 
           dims     = 6, 
           cells    = 100, 
           balanced = TRUE, 
           ncol     = 3)

These visualizations are useful to explore our data and have an idea of how many PCs we need to include for downstream analyses.

Q10. Select one visualization and describe it.

Determing the dimensionality of your data

Next we need to remove excess noise from too many principal components for later clustering and expression analysis. Besides the methods above, we can visualize significant PCs and rank the PCs in order to determine the appropriate number of dimensions for which to use in further processing. One of the methods to accomplish this is ElbowPlot():

ElbowPlot(sceDataSubset, 
          reduction = "pca")

Q11. According to the ElbowPlot, around which PCs do we see that the majority of true variance is captured?

Finding the true dimensionality of a dataset can be challenging, it is an error and trial task. There is a statistical based method (JackStrawPlot()) that can give you a better idea of which PCs to select in more complex data analysis. It is a slow method so you can come back to it later on:

# OPTIONAL
# Perform a statistical based PC ranking

#sceDataSubset <- JackStraw(sceDataSubset, 
                            num.replicate = 100, 
                            dims          = 20)

#sceDataSubset <- ScoreJackStraw(sceDataSubset, 
                                 dims = 1:20)

#JackStrawPlot(sceDataSubset, 
               dims      = 1:15, 
               reduction = "pca", 
               xmax      = 0.5, 
               ymax      = 1)

Clustering

Now we will use the PCs you selected previously to create a distance matrix to separate cells into groups of similar feature expression patterns.

PC = ADD_YOUR_SELECTED_pc

sceDataSubset <- FindNeighbors(sceDataSubset, 
                               dims = 1:PC)

sceDataSubset <- FindClusters(sceDataSubset, 
                              resolution = 0.5)

table(Idents(sceDataSubset)) # n clusters with the amount of cells

The resolution dictates the amount of clusters you want to obtain, values above 1.0 gives you more clusters, while values below 1.0 generates less clusters.

Q12. How many clusters did you generate with resolution = 0.5?

Non-linear Dimensional Reduction

tSNE and UMAP are two non-linear dimensionality reduction algorithms that are typically used following PCA to place similar cells together and reduce in the dimension space based on the PCs identified previously.

# Non-linear reduction
sceDataSubset <- RunUMAP(sceDataSubset, 
                         dims = 1:PC)

sceDataSubset <- RunTSNE(sceDataSubset, 
                         dims = 1:PC)

# Current reductions
sceDataSubset@reductions

# Visualize the seurat clusters
DimPlot(sceDataSubset, 
        reduction = "umap")

DimPlot(sceDataSubset, 
        reduction = "tsne")

Going back to the paper

Q13. Did you identify the same amount of clusters? If not, try using different parameters for both dimensions and resolution. Comment on how the results differ

To visualize the separation of cells by condition, we would need to change the current active identity as shown below:

# Current identity
unique(Idents(sceDataSubset))

# Change identity to the conditions
Idents(sceDataSubset) <- "Condition"

# Current identity
unique(Idents(sceDataSubset))

# Visualize the seurat clusters
DimPlot(sceDataSubset)

# Return Idents to seurat_clusters for rest of analysis
Idents(sceDataSubset) <- "seurat_clusters"

Q14. Given the above plot, do you think the parameters used for variable features, dimensional reduction and clustering accurately represent the transcriptomic profile of our MEC data?

Variable Features

In previous sections, we identified highly variable features (or genes) and used those to inform the PCA and later dimensional reduction. Let's take a look at these features and continue the the transcriptomic profiling of the MEC data.

Plotting the Features

As a reminder, let's plot the top 10 most variant genes:

LabelPoints(plot   = VariableFeaturePlot(sceDataSubset), 
            points = top10, 
            repel  = TRUE, 
            xnudge = -2, 
            ynudge = -2)

Now, let's visualize where these top features are located within our clusters.

FeaturePlot(sceDataSubset, 
            features = top10, 
            ncol     = 3) + 
            theme_minimal(base_size = 8)

Q15. How are the genes distributed?

Finding Marker genes

Now we can identify differentially expressed features that define our clusters. There are two main functions and any number of combinations to identify marker genes in their respective clusters. Due to our time limit we cannot explore all of these but, if you are interested you can take a look at the different clusters and parameters.

First we can look at all the features that identify cluster 1,

# Require a feature to be detected at a minimum percent in either group of cells
cluster1.markers <- FindMarkers(sceDataSubset, 
                                ident.1 = 1, 
                                min.pct = 0.25)

head(cluster1.markers, 
     n = 5)

Let's look at all the markers that define every cluster, but only at the positive markers. This can take a while, so have some coffee/lunch in the meantime:

mce.markers <- FindAllMarkers(sceDataSubset, 
                              only.pos        = TRUE, 
                              min.pct         = 0.25, 
                              logfc.threshold = 0.25)

mce.markers %>%
    group_by(cluster) %>%
    top_n(n = 2, wt = avg_log2FC)

mce.markers

Q16. Do these markers match the top10 features you looked at previously?

Visualizing Markers

It is important to consider the research question when deciding how to visualize your markers. To explore how a specific marker is expressed in different clusters, the ViolinPlot can again be useful. Let's try with one of the markers identified in the paper,Krt18:

# In our dataset, we pasted together the GeneName and EnsemblID so we will need to search for the name in our original dataset from above
grep("^Krt18.*", 
     rownames(sceDataSubset), 
     value = TRUE, 
     perl  = TRUE)

VlnPlot(sceDataSubset, 
        features = "Krt18.ENSMUSG00000023043")

# This same information can be visualized in our cluster map
FeaturePlot(sceDataSubset, 
            features = "Krt18.ENSMUSG00000023043")

Q17. Which cluster(s) appear to show the most expression for this feature? Does that match with the paper?

To get an overview of all the markers we identified in our dataset, we can use a heatmap to visualize the top 5 markers for each cluster.

mce.markers %>%
    group_by(cluster) %>%
    top_n(n = 5, wt = avg_log2FC) -> top5_cluster
head(mce.markers)

DoHeatmap(sceDataSubset, 
          features = top5_cluster$gene) + 
          NoLegend() +
          theme(text = element_text(size = 10))

Q18. Were you able to identify the same markers as they did in the paper? What could have resulted in these differences?

Great! now you have a better idea on how to analyze single cell data!

Home: Bioinformatics 2

Developed by Alina Orozco, 2021.