GEx scRNAseq analysis - bcfgothenburg/HT23 GitHub Wiki
Course: HT23 Gene expression analysis using R (SC00034)
For this exercise you we will be using a mouse brain dataset from two different conditions; one with a transgene APPS/PS1 and one WT. You will classify the cells into different cell types and then compare the expression patterns between APPS/PS1 and WT in Microglia cells. If you are interested, have a look at the article by Van Hove, et. al based on these datasets.
Good Luck!
Install packages and load dataset
There are many software packages for analyzing single cell RNA data; Monocle, Scanpy, SLICER, slingshot, and many others. The software we will use today is one of the most common one, Seurat. Seurat takes a count matrix as input and uses several statistical methods to perform clustering and differential expression on the data. This practical is partly based on the Satija Lab tutorial Seurat Vignette.
-
First you have to load the packages
dplyrandSeuratusing thelibrary()function. -
If you don't have them go ahead and install them using
install.packages(). -
Check that you have the correct version of Seurat, should be 4.something. To do this you can run
sessionInfo() -
Download the Gene-cell count matrix for "16M old APPS/PS1 and WT aggregate" from CANVAS or from the Brain immune atlas. Unzip the folder (
tar xvzforunzip). -
Read in the count data into R using the Seurat function
Read10X(). -
Look at the data to see the structure and dimensions.
-
Create a Seurat object where counts and all metadata will be stored using
createSeuratObject(). When creating the object, remove genes that are expressed in fewer than 3. cell, and remove cells that have less then 200 genes expressed. You can change the numbers if you want. -
Take a look at the object again.
Q1. How many genes and cells where removed?
QC quality control
This is a very important step in the analysis. It is here you remove most of the cells and genes that can create biases due to cell duplicates, cell lysis and cells with very few reads in total. The best way of getting to know your data is by plotting.
- Calculate the percentage of mitochondrial gene expression for every cell by using
PercentageFeatureSet(). Make sure to use the correct pattern"^mt-", in some datasets it can be uppercase and in some lowercase. Assign the new data as metadata in the objectobject["percent.mt"](/bcfgothenburg/HT23/wiki/"percent.mt") <-. If you are unsure you used the correct pattern, one way to check is to use thesum()function, if the sum is 0 then it's probably not correct.
Now do the plotting
-
Look at the meta data using
head(object[](/bcfgothenburg/HT23/wiki/)). Use the column names when plotting. -
Plot the 1) number of genes (nFeature_RNA), 2) number of counts (nCount_RNA) per cell and 3) mt-percentage (percent.mt) per cell using
VlnPlot(). If you want to plot all in the same plot then specify the column names as a listfeatures = c('colname','colname','colname')and addncol = 3in the end.
Now we will try to decide which cells that might be duplicates or destroyed and needs to be filtered out. Make two new plots:
-
Use
FeatureScatter()to plot the percentage mt on y-axis and counts on x-axis -
Do the same but with number of Genes on y-axis.
Have a good look at the scatter plots and try to decide on where you want to put the threshold for filtering. Remember that we want to get rid of the cells with high and low unique feature count, and also cells with high mt-expression.
Filtering
- Filter out the cells using the thresholds you decided on based on the plots. Use the Seurat function
subset()to do this.
Q2. How many cells and genes do we have left in our dataset?
Normalize and scale the data
Now let's normalize the data so that we can compare the cells to each other.
- Normalize the data using
normalizeData()function, and use the "Log Normalize" method.
Next we will find a subset of Genes that exhibit the most cell-to-cell variation in the dataset. Default in Seurat is to find 2000 genes, this can however be changed. This genes will then be used for downstream analysis only. This speeds up the analysis and highlights the most important signals in the dataset and reduces the noise.
- Use the function
FindVariableFeatures()to find these genes. Selectvstasselection.method.
Save the top 10 genes by running this command top10 <- head(VariableFeatures(object), 10)
-
Plot all the features and highlight the variable features in red, this is automatically done using the function
VariableFeaturePlot(). -
Try to add the top10 gene labels to the plot by using
LabelPoints(). Fill in the code below:
plot1 <- ...(...)
LabelPoints(plot = plot1, points = ...)
Now its time to scale the data as an pre-process step before PCA.
- Scale the data using
ScaleData(). It will by default scale all the most variable genes. Here we would like to scale all genes in our object. Add the argumentfeatures=rownames(Object).
Q3. Why do you scale and normalize your data?
Dimensional Reduction using PCA
New its time to reduce the dimensions on the scaled data and visualize the data.
-
Use
RunPCA()with default settings to reduce the dimensions. -
Make a Dim-Plot using
DimPlot()and specifyreductionmethod.
One way to look at batch effects in our data is by looking at the PCA coloured based on the samples. To do this we need to add sample info to our meta data. Usually different samples will get a different tag on the cell barcodes.
- Take a look at the cell barcodes and see if you can see the tags. One way is by using the head(colnames()) function.
The barcodes is usually a "-" followed by a number. In this case, all barcodes ending with -1 is the WT, and the ones ending with -2 is the APP/PI1 sample.
-
Add this info to the meta data in our object by first adding the numbers
sapply(strsplit(colnames(brain_object), split='-'), '[[',2). Remember that you add this meta data by assigning it toobject$new_meta. -
Look to check that you have the meta data in the object
head([email protected]). -
Them make a new plot but this time add
group.by=.
You can visualize the different PCs by plotting a heat map for every PC.
Now its time to determine how many PCs to use for downstream analysis. To many PCs ads more noise to the data, and to few do not give as enough information. One way of deciding how many PCs to use is by making an elbow plot which shows how many standard variations that is explained in every PC.
- Plot an elbow plot using function
ElbowPlot().
The number you choose depends on the biological question you have. More PCs might find rare subtypes of cells, but might also add some noise to your analysis.
Q4. How many PCs would you decide to continue with? And why?
Clustering
Now it's time to cluster our cells. Seurat uses a graph-based method to cluster the cells based on the PCs you choose to use.
-
Start clustering your cells by creating the graph using the function
FindNeighbors(), define how many dimensions (in this case PCs) you would like to use. -
Cluster your cells using
FindClusters(). Here you have to decide on what resolution to use. Normally setting the resolution between 0.2-1.2 gives a good result. Lower resolution gives you fewer clusters and a higher resolution give you more clusters. One suggestion is to try with different parameters and look at the UMAP!
Visualize the clusters using non-linear dimensional reduction technique UMAP (if UMAP doesn't work for you use tSNE instead RunTSNE()).
-
Run
RunUMAP()using the same number of dims as before. -
Plot using
DimPlot(). Don't forget to specify the reduction method.
Q5. What resolution do you think was the best one? And why?
Differential Expression
When you are satisfied with the clusters you have, then it's time to do cluster classification and differential expression (DE) analysis. You can get DE genes for one cluster compared to another, one clusters compared to all the other clusters or one condition compared to another. Here we will try to get the DE genes for the different conditions APP/PS1 and WT in one specific cell type.
Try and see if you can classify any of your clusters as a specific cell type based on marker genes. One cell type is fine, but if you want to, go ahead and classify more. A hint is to use this Van Hove et al article to find marker genes.
-
Use the plot function
FeaturePlot()to visualize where this markers are expressed. -
When you think you have the correct cluster classified as your cell type, then add this info as new meta data
brain_object_14_03$orig_clusters <- Idents(brain_object_14_03) # Save the original clusters as meta first
This is an example on how you can add the new annotations as meta:
clusters <- c('0', '1', '2', '3', '4', '5', '6', '7', '8', '9', '10', '11', '12') # Create a vector with the original clusters
new_class <- c('0', '1', '2', '3', '4', '5', 'T-cells', '7', '8', '9', '10', '11', '12') # Create vector with the new annotations
brain_object_14_03$new_annotation <- plyr::mapvalues(x = brain_object_14_03$orig_clusters, from = clusters, to = new_class) # And add as meta data
- Use
DimPlot()to plot the new annotations. Addgroup.by=to specify the new annotations.
In the following step you will subset the cell type you are interested in and then perform DE analysis.
-
Use the
subset()function with the argumentidents=to create a new object with only this cell type. (If R gives you an option about which subset to use; choose Seurat). -
Before running the DE analysis we need to change the Idents of this object (that means the main meta data that seurat uses for analysis). We are going to use the conditions (APP/PS1 and WT) in the comparison and need to add this as the object Ident.
Idents(Object) <- '...' # add the condition meta name instead of ...
- Perform differential expression analysis using the function
FindAllMarkers(). To speed ap the analysis also addonly.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.001. Assign the markers tomarkers.
Remember that
- ID 1 = WT
- ID 2 = APP/PS1
-
Take a look at the top 10 positive markers for APP/PS1 and WT
top10 <- markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC) -
Visualize the expression of one/some of these genes using
VlnPlot(),FeaturePlot()orDotPlot(). -
Do a heat map using
DoHeatmap()showing the expression for the top10 genes of each condition by addingfeatures=top10$geneas argument.
Q6. In this cell type, do APP/PS1 differ from WT? What are the top positive marker genes?
Well done everyone! Now you have a good understanding about what information you can get out from singe-cell RNA-seq data.
Home: Gene expression using R
Developed by Vanja Börjesson, 2019. Revised by Vanja Börjesson 2020 and 2021.