GEx scRNAseq analysis - bcfgothenburg/HT23 GitHub Wiki

Course: HT23 Gene expression analysis using R (SC00034)

For this exercise you we will be using a mouse brain dataset from two different conditions; one with a transgene APPS/PS1 and one WT. You will classify the cells into different cell types and then compare the expression patterns between APPS/PS1 and WT in Microglia cells. If you are interested, have a look at the article by Van Hove, et. al based on these datasets.

Good Luck!

Install packages and load dataset

There are many software packages for analyzing single cell RNA data; Monocle, Scanpy, SLICER, slingshot, and many others. The software we will use today is one of the most common one, Seurat. Seurat takes a count matrix as input and uses several statistical methods to perform clustering and differential expression on the data. This practical is partly based on the Satija Lab tutorial Seurat Vignette.

First you have to load the packages dplyr and Seurat using the library() function.
If you don't have them go ahead and install them using install.packages().
Check that you have the correct version of Seurat, should be 4.something. To do this you can run sessionInfo()
Download the Gene-cell count matrix for "16M old APPS/PS1 and WT aggregate" from CANVAS or from the Brain immune atlas. Unzip the folder (tar xvzf or unzip).
Read in the count data into R using the Seurat function Read10X().
Look at the data to see the structure and dimensions.
Create a Seurat object where counts and all metadata will be stored using createSeuratObject(). When creating the object, remove genes that are expressed in fewer than 3. cell, and remove cells that have less then 200 genes expressed. You can change the numbers if you want.
Take a look at the object again.

Q1. How many genes and cells where removed?

QC quality control

This is a very important step in the analysis. It is here you remove most of the cells and genes that can create biases due to cell duplicates, cell lysis and cells with very few reads in total. The best way of getting to know your data is by plotting.

Calculate the percentage of mitochondrial gene expression for every cell by using PercentageFeatureSet(). Make sure to use the correct pattern "^mt-", in some datasets it can be uppercase and in some lowercase. Assign the new data as metadata in the object object["percent.mt"](/bcfgothenburg/HT23/wiki/"percent.mt") <-. If you are unsure you used the correct pattern, one way to check is to use the sum() function, if the sum is 0 then it's probably not correct.

Now do the plotting

Look at the meta data using head(object[](/bcfgothenburg/HT23/wiki/)). Use the column names when plotting.
Plot the 1) number of genes (nFeature_RNA), 2) number of counts (nCount_RNA) per cell and 3) mt-percentage (percent.mt) per cell using VlnPlot(). If you want to plot all in the same plot then specify the column names as a list features = c('colname','colname','colname') and add ncol = 3 in the end.

Now we will try to decide which cells that might be duplicates or destroyed and needs to be filtered out. Make two new plots:

Use FeatureScatter() to plot the percentage mt on y-axis and counts on x-axis
Do the same but with number of Genes on y-axis.

Have a good look at the scatter plots and try to decide on where you want to put the threshold for filtering. Remember that we want to get rid of the cells with high and low unique feature count, and also cells with high mt-expression.

Filtering

Filter out the cells using the thresholds you decided on based on the plots. Use the Seurat function subset() to do this.

Q2. How many cells and genes do we have left in our dataset?

Normalize and scale the data

Now let's normalize the data so that we can compare the cells to each other.

Normalize the data using normalizeData() function, and use the "Log Normalize" method.

Next we will find a subset of Genes that exhibit the most cell-to-cell variation in the dataset. Default in Seurat is to find 2000 genes, this can however be changed. This genes will then be used for downstream analysis only. This speeds up the analysis and highlights the most important signals in the dataset and reduces the noise.

Use the function FindVariableFeatures() to find these genes. Select vst as selection.method.

Save the top 10 genes by running this command top10 <- head(VariableFeatures(object), 10)

Plot all the features and highlight the variable features in red, this is automatically done using the function VariableFeaturePlot().
Try to add the top10 gene labels to the plot by using LabelPoints(). Fill in the code below:

plot1 <- ...(...)            
LabelPoints(plot = plot1, points = ...)

Now its time to scale the data as an pre-process step before PCA.

Scale the data using ScaleData() . It will by default scale all the most variable genes. Here we would like to scale all genes in our object. Add the argument features=rownames(Object).

Q3. Why do you scale and normalize your data?

Dimensional Reduction using PCA

New its time to reduce the dimensions on the scaled data and visualize the data.

Use RunPCA() with default settings to reduce the dimensions.
Make a Dim-Plot using DimPlot() and specify reduction method.

One way to look at batch effects in our data is by looking at the PCA coloured based on the samples. To do this we need to add sample info to our meta data. Usually different samples will get a different tag on the cell barcodes.

Take a look at the cell barcodes and see if you can see the tags. One way is by using the head(colnames()) function.

The barcodes is usually a "-" followed by a number. In this case, all barcodes ending with -1 is the WT, and the ones ending with -2 is the APP/PI1 sample.

Add this info to the meta data in our object by first adding the numbers sapply(strsplit(colnames(brain_object), split='-'), '[[',2). Remember that you add this meta data by assigning it to object$new_meta.
Look to check that you have the meta data in the object head([email protected]).
Them make a new plot but this time add group.by=.

You can visualize the different PCs by plotting a heat map for every PC.

Now its time to determine how many PCs to use for downstream analysis. To many PCs ads more noise to the data, and to few do not give as enough information. One way of deciding how many PCs to use is by making an elbow plot which shows how many standard variations that is explained in every PC.

Plot an elbow plot using function ElbowPlot().

The number you choose depends on the biological question you have. More PCs might find rare subtypes of cells, but might also add some noise to your analysis.

Q4. How many PCs would you decide to continue with? And why?

Clustering

Now it's time to cluster our cells. Seurat uses a graph-based method to cluster the cells based on the PCs you choose to use.

Start clustering your cells by creating the graph using the function FindNeighbors(), define how many dimensions (in this case PCs) you would like to use.
Cluster your cells using FindClusters(). Here you have to decide on what resolution to use. Normally setting the resolution between 0.2-1.2 gives a good result. Lower resolution gives you fewer clusters and a higher resolution give you more clusters. One suggestion is to try with different parameters and look at the UMAP!

Visualize the clusters using non-linear dimensional reduction technique UMAP (if UMAP doesn't work for you use tSNE instead RunTSNE()).

Run RunUMAP() using the same number of dims as before.
Plot using DimPlot(). Don't forget to specify the reduction method.

Q5. What resolution do you think was the best one? And why?

Differential Expression

When you are satisfied with the clusters you have, then it's time to do cluster classification and differential expression (DE) analysis. You can get DE genes for one cluster compared to another, one clusters compared to all the other clusters or one condition compared to another. Here we will try to get the DE genes for the different conditions APP/PS1 and WT in one specific cell type.

Try and see if you can classify any of your clusters as a specific cell type based on marker genes. One cell type is fine, but if you want to, go ahead and classify more. A hint is to use this Van Hove et al article to find marker genes.

Use the plot function FeaturePlot() to visualize where this markers are expressed.
When you think you have the correct cluster classified as your cell type, then add this info as new meta data

brain_object_14_03$orig_clusters <- Idents(brain_object_14_03) # Save the original clusters as meta first

This is an example on how you can add the new annotations as meta:

clusters <- c('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12') # Create a vector with the original clusters
new_class <- c('0', '1', '2', '3', '4', '5', 'T-cells', '7', '8', '9', '10', '11', '12') # Create vector with the new annotations     
brain_object_14_03$new_annotation <- plyr::mapvalues(x = brain_object_14_03$orig_clusters, from = clusters, to = new_class) # And add as meta data

Use DimPlot() to plot the new annotations. Add group.by= to specify the new annotations.

In the following step you will subset the cell type you are interested in and then perform DE analysis.

Use the subset() function with the argument idents= to create a new object with only this cell type. (If R gives you an option about which subset to use; choose Seurat).
Before running the DE analysis we need to change the Idents of this object (that means the main meta data that seurat uses for analysis). We are going to use the conditions (APP/PS1 and WT) in the comparison and need to add this as the object Ident.

Idents(Object) <- '...' # add the condition meta name instead of ...

Perform differential expression analysis using the function FindAllMarkers(). To speed ap the analysis also add only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.001. Assign the markers to markers.

Remember that

ID 1 = WT
ID 2 = APP/PS1

Take a look at the top 10 positive markers for APP/PS1 and WT top10 <- markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
Visualize the expression of one/some of these genes using VlnPlot(), FeaturePlot() or DotPlot().
Do a heat map using DoHeatmap() showing the expression for the top10 genes of each condition by adding features=top10$gene as argument.

Q6. In this cell type, do APP/PS1 differ from WT? What are the top positive marker genes?

Well done everyone! Now you have a good understanding about what information you can get out from singe-cell RNA-seq data.

Home: Gene expression using R

Developed by Vanja Börjesson, 2019. Revised by Vanja Börjesson 2020 and 2021.