BMA231 VII: Expression analysis - bcfgothenburg/HT24 GitHub Wiki
Course: HT24 Next Generation Sequencing data analysis with clinical applications (BMA231)
The aim of this practical is to introduce you to a general workflow to analyze bulk RNAseq data. To be able to work within the allocated time, you will be processing just a portion of the real dataset to illustrate the overall workflow. For the interpretation of the results, you will be provided with the results of the entire dataset. This will be indicated accordingly.
You will be working with data from an article, where Pierce et. al. compared bulk RNA sequencing from pedriatic and adult patients with COVID-19. They provided direct evidence of a more vigorous early mucosal immune response in children compared with adults, suggesting that this contributes to favorable clinical outcomes.
Their entire data is available at GEO under GSM5251610, we will be working only with a subset to exemplify how to generate the counts matrix, and we will be using the entire dataset for the differential expression (DE) analysis.
Data | File |
---|---|
Raw data (complete dataset) | all_raw_data.html |
Aligned data (complete dataset) | all_aligned_data.html |
Alignments (partial dataset) |
SRR14267546_chr18_star.bam , SRR14267547_chr18_star.bam , SRR14267548_chr18_star.bam , SRR14267549_chr18_star.bam
|
Counts matrix (complete dataset) | covid_counts.tsv |
Metadata (complete dataset) | covid_metadata.tsv |
You find this files in CANVAS under the Expression analysis
module:
RNAseq_alignments.zip
MultiQC_expression.zip
covid_counts.tsv
covid_metadata.tsv
GRCh38_90.bed
As we have learnt, the first thing to do when you receive data is to check its quality. Then trim the data based on quality, and after inspection, map the reads to a reference genome.
It is also important to get familiar with any extra information related to our data, what we call metadata
. Have a look at the sample information file covid_metadata.tsv
. There are 15 samples from adults and 6 from children, where 12 are males and 9 are females. This metadata will be important to assess any biases in the DE analysis.
Click here for output
These samples were analyzed with FastQC in Galaxy (all_raw_data.html
). Then they were filtered with TrimGalore! using default parameters. Since we are dealing with RNA, the mapping was done with STAR, an aligner that can handle the non-contiguous transcript structure of RNAseq experiments (also with default parameters). Some statistics were generated with samtools
, and all the resulting files were compiled in a MultiQC
html report (all_aligned_data.html
).
Inspect the all_raw_data.html
report.
Q1. What can you conclude about the initial quality of the reads?
Focus onBasic Statistics
,Per base sequence quality
,Per sequence quality scores
,Per base sequence content
,Sequence Length Distribution
,Sequende Duplication Levels
andAdapter Content
Click here for output
30.9 M is the least amount of reads, it is enough to proceed.
We directly see we have duplications, typical for an RNAseq experiment
All sample quality is ok, maybe a little too good, so maybe there was already some trimming
All samples fail the per base sequence content.
There is a bias a the beginning of the reads, which in this case is
expected and shows enzymatic shearing
For the sequence duplication levels, all fail
there is a high rate of duplication, but this is expected since it
is RNA sequencing data, which shows how much a gene is being expressed,
thus the more duplications the more expression
RNAseq tends to have different sizes of cDNA fragments.
This can be seen in the sequencing of the adapter since the cycles
(length of the read) are always fixed.
For instance if a cDNA fragment is 50 bp, and we are sequencing 100 cycles,
50 bp will be from the adapter.
Adapter trimming is needed to remove this artifact
Now look at the all_aligned_data.html
report:
Q2. What is the highest number of sequences that were removed during the filtering process?
Are there any low mapping samples? If there are, what would be your approach to investigate the source of the low amount of reads mapped?
Looking at theXY counts
, can you determine how many are females and how many are males? Do they match the annotation we have from thecovid_metadata.tsv
?
Looking at theMapped reads per contig
, do all the samples have the same coverage over the chromosomes? If not, mention some examples
Looking at theSTAR
summary, what do you think about the aligment of all samples?
Click here for output
The higher number of sequences removed accounts for a 3.7% (SRR1426764)
so not many reads were removed in general
there are 3 samples with low mapping number of sequences:
53 (baby) 41.8%
62 (pool sample) 36.8%
63 (pool sample) 16.0%
We could map the reads to other organisms,
or use some reads and blast them to identify what are the
non-mapping reads and try to figure out if
there was any contamination
Listed are females according to the metadata.
All have low Y content (homologous sequences between X and Y)
except for those indicated with a %
SRR14267547
SRR14267551
SRR14267554
SRR14267558
SRR14267559
SRR14267560
SRR14267564 - 5%
SRR14267565 - 8%
SRR14267566
If this is relevant for the analysis this needs further investigation
in terms of mapped read, chr14 seems to have higher coverage
in some samples than others,
the same in chr21 and MT, after the analysis we could check if
statistically significant DE genes contribute to this behaviour
As seen before, there are 3 samples with low mapping %,
and we see here that it is due to too short reads
For this part, you will be using Galaxy.
- Log in into your account
- Create a history called
RNAseq practical
. You do this in the rightHistory panel
.
You will be analyzing only 4 of the 21 samples with data from chr18, since we just want to exemplify the workflow. The data was extracted from the complete alignments.
-
Click on
Upload Data
on the left menu -
The
Download from Disk or Web
window will open -
Click on
Choose local file
-
Select all of four BAM files
-
Genome -->
Human Dec. 2013 (GRCh38/hg38) (hg38)
-
Click on
Start
Click here for output
-
Wait until the data has been uploaded
- You will see that both of your jobs were added on the
History panel
- You will see that both of your jobs were added on the
RSeQC is a package that provides a number of useful modules to evaluate RNA-seq data. Here we will check:
- the distribution of reads across known gene features (
Read distribution
), - check if there is any bias in coverage (
Gene body coverage
) and, - check the strandness (
Infer Experiment
)
which will help us to correctly set the parameters when we do the gene counting. And then we'll gather results with MultiQC
.
For most of these calculations we need a file describing the genes within our genome. This is usually a tab delimited file (BED format
) with information about where the genes/exons are located. To make this quicker, use the hg38.90_chr18.bed
which contains information only from chr18.
-
Click on
Upload Data
on the left menu -
The
Download from Disk or Web
window will open -
Click on
Choose local file
-
Select hg38.90_chr18.bed file
-
Type -->
bed12
-
Genome -->
Human Dec. 2013 (GRCh38/hg38) (hg38)
-
Click on
Start
Click here for output
- Search for
Read Distribution
in theTools panel
- Click on the tool
- Set the following parameters as indicated:
-
Input BAM/SAM file
- Select the icon for
Multiple datasets
- Select all four alignments -->
*chr18_star.bam
- Select the icon for
-
Reference gene model -->
hg38.90_chr18.bed
- Click Run Tool
Click here for output
-
Input BAM/SAM file
-
Search for
Gene Body Coverage (BAM)
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Run each sample separately, or combine mutiple samples into one plot -->
Combine multiple samples into a single plot
-
Input .bam file(s)
- Select the icon for
Multiple datasets
- Select all four alignments -->
*chr18_star.bam
- Select the icon for
-
Reference gene model -->
hg38.90_chr18.bed
- Click Run Tool
Click here for output
-
Run each sample separately, or combine mutiple samples into one plot -->
-
Search for
Infer experiment
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Input BAM file
- Select the icon for
Multiple datasets
- Select all four alignments -->
*chr18_star.bam
- Select the icon for
-
Reference gene model -->
hg38.90_chr18.bed
- Click Run Tool
Click here for output
-
Input BAM file
-
Search for
MultiQC
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Which tool was used generate logs? -->
RSeQC
-
RSeQC output --> Type of RSeQC output? -->
Read distribution
-
RSeQC read distribution: stats output --> all
Read Distribution on data X and data X
files -
Click on Insert RSeQC ouptut
-
Type of RSeQC output? -->
Gene body coverage
-
RSeQC gene body_coverage: stats file -->
Gene Body Coverage (BAM) on data X and data X, and others: stats (TXT)
files -
Click on Insert RSeQC ouptut
-
Type of RSeQC output? -->
Infer experiment
-
RSeQC infer experiment: configuration output --> all
Infer Experiment on data X and data X: RNA-seq experiment configuration
files -
Report title -->
Aligned data extra stats
-
Custom comment -->
Aligned data extra stats
-
Click Run Tool
Click here for output
-
Q3. To which region are most of the reads mapping? Why?
What can you conclude frome the Gene body coverage plot?
Which library protocol was used for these samples (stranded, reverse stranded or unstranded)?
Click here for output
Our reads mapped mostrly towards introns. Depending on the
protocol used, if we are using total RNA
then we will be having immature transcripts and non-coding
that are not really counted in RSeQC, alternatively maybe
the version of the genome used to map vs the one used to
check the distribution are differents (this would be the
first to check)
We see a Bias towards 5', this can be due to the library
prepartion, like ribo minus and stranded protocols
We have a reverse stranded protocol since most of the
reads were antisense. This is important to know for the
counting process
Let's count the number of reads that are aligned towards the genes so we can assess gene expression. There are multiple tools to extract the gene counts, here we will be using featureCounts.
-
Search for
featureCounts
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Alignment file
- Select the icon for
Multiple datasets
- Select all four alignments -->
*chr18_star.bam
- Select the icon for
-
Specify strand information -->
Stranded (Reverse)
-
Gene annotation file -->
featureCounts built-in
-
Select built-in genome -->
hg38
-
Does the input have read pairs? -->
Yes, paired-end and count them as 1 single fragment
-
Read filtering options
-
Minimum mapping quality per read -->
20
-
Minimum mapping quality per read -->
-
Click Run Tool
Click here for output
-
You will get 2 output files per sample:
-
One labeled as Counts:
featureCounts on data X: Counts
-
Click on the
eye icon
This file contains all the genes and their corresponding counts (or expression values).
Most are ceros because we only have data from chr18Click here for output
-
-
The second file is labeled as Summary:
featureCounts on data X: Summary
- Click on the
eye icon
This file tells you how the different reads (or fragments in our case) were found according to the gene location.Click here for output
- Click on the
Remember, you can use MultiQC
to gather this information and have an overall visual representation of all the samples.
Click here for output
Now let's gather the information to create the count matrix, which will be the input in the DE analysis.
-
Search for
Column Join on multiple datasets
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Tabular files
- Select the icon for
Multiple datasets
- Select all four Counts files -->
featureCounts on data X: Counts
- Select the icon for
-
Identifier column -->
1
-
Number of header lines in each input file -->
1
-
Add column name to header -->
No
- Click Run Tool
Click here for output
-
Tabular files
The output file will look like:
As you see, the identifiers of the genes are ENTREZ IDs. Let's change it to a more meaningful name (for us):
-
Search for
annotateMyIDs
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
File with IDs -->
Column join on data X, data X, and others
-
File has header? -->
Yes
-
Organism -->
Human
-
ID Type -->
Entrez
-
Output columns --> Select
ENSEMBL
,ENTREZID
andSYMBOL
(just for demonstration purposes) - Click Run Tool
Click here for output
-
File with IDs -->
The output file will contain columns with different identifiers (the ones you chose).
Click here for output
Now we need to merge both tables:
-
Search for
Join two Datasets
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Join -->
Column join on data X, data X, and others
-
using column -->
Column: 1
(GeneID column) -
with -->
annotateMyIds on data X: Annotated IDS
-
and column -->
Column 1: Entrez
(EntrezID) -
Keep lines of first input that do not join with second input -->
Yes
-
Keep lines of first input that are incomplete -->
No
-
Fill empty columns -->
No
-
Keep the header lines -->
Yes
- Click Execute
Click here for output
-
Join -->
The output file will contain all columns from both tables.
Click here for output
And finally, we select the important columns:
-
Search for
Cut
in theTools panel
-
Click on the tool
-
Set the following parameters as indicated:
-
Cut columns -->
c8,c2,c3,c4,c5
-
Delimited by -->
Tab
-
From -->
Join two Datasets on data X and data X
- Click Run Tool
Click here for output
-
Cut columns -->
This last table is the input for the DE analysis, where each column is a sample, each row is a gene and each value is the expression value (raw counts).
Click here for output
Now you will use DESeq2
, an R package that estimates variance-mean dependence in count data from high-throughput sequencing assays and tests for differential expression based on a model using the negative binomial distribution.
In most of the cases, you should copy/paste
the code. There will be some instances where you will be asked to decide on a value/threshold/color and this will be highlighted in bold within the instructions and in CAPS
within the code.
covid_counts.tsv
contains the complete dataset of samples (21 in our case) covering all the annotated genes in the human genome. Each column represents a sample while each row represents a gene.
covid_metadata.tsv
is a metadata file. This contains information about the samples that can help us later on to investigate batch effects or confounding elements, among others.
Start R
in your local computer.
Select your working directory (where you saved the annotated files): Session -> Set Working Directory -> Choose directory
Let’s load the packages we will be using:
# Install them first if you don't have them already
#
# if (!require("BiocManager", quietly = TRUE))
# install.packages("BiocManager")
# BiocManager::install("DESeq2")
# BiocManager::install('EnhancedVolcano')
#
# install.packages("pheatmap")
# install.packages("knitr")
# install.packages("tidyverse")
# Load required packages
library(DESeq2) # DE analysis
library(pheatmap) # heatmaps
library(knitr) # tables
library(ggplot2) # plots
library(EnhancedVolcano) # volcano plot
Now let's read the counts and the metadata. In the code below, replace the text with your file name. Make sure the name is within double quotes:
# Reading the counts data
counts<-read.table("WRITE_THE_NAME_OF_THE_COUNTS_FILE", sep="\t",
header=T, row.names=1, as.is=T)
# Inspecting the file
head(counts)
Click here for output
SRR14267546 SRR14267547 SRR14267548 SRR14267549 SRR14267550
DDX11L1 0 0 0 0 0
WASH7P 25 145 65 48 100
MIR6859-1 0 23 15 5 31
MIR1302-2HG 0 0 0 0 0
MIR1302-2 0 0 0 0 0
FAM138A 0 0 0 0 0
SRR14267551 SRR14267552 SRR14267553 SRR14267554 SRR14267555
DDX11L1 0 0 4 0 0
WASH7P 94 97 41 65 67
MIR6859-1 6 8 2 4 34
MIR1302-2HG 0 0 0 0 0
MIR1302-2 0 0 0 0 0
FAM138A 0 0 0 0 0
SRR14267556 SRR14267557 SRR14267558 SRR14267559 SRR14267560
DDX11L1 0 0 0 0 0
WASH7P 69 70 90 59 60
MIR6859-1 21 12 2 4 6
MIR1302-2HG 0 0 0 0 0
MIR1302-2 0 0 0 0 0
FAM138A 0 0 0 0 0
SRR14267561 SRR14267562 SRR14267563 SRR14267564 SRR14267565
DDX11L1 0 0 0 0 0
WASH7P 67 26 8 67 174
MIR6859-1 3 0 2 0 26
MIR1302-2HG 0 0 0 0 0
MIR1302-2 0 0 0 0 0
FAM138A 0 0 0 0 0
SRR14267566
DDX11L1 0
WASH7P 135
MIR6859-1 38
MIR1302-2HG 0
MIR1302-2 0
FAM138A 0
# Reading the sample information (metadata)
sampleInfo<-read.table("WRITE_THE_NAME_OF_THE_METADATA_FILE", sep="\t",
header=T, row.names=1, as.is=T)
sampleInfo
Click here for output
group sex age
SRR14267546 adult male 19.00
SRR14267547 pediatric female 10.00
SRR14267548 pediatric male 9.00
SRR14267549 pediatric male 10.00
SRR14267550 adult male 20.00
SRR14267551 adult female 30.00
SRR14267552 adult male 19.00
SRR14267553 pediatric male 0.42
SRR14267554 pediatric female 1.00
SRR14267555 adult male 71.00
SRR14267556 pediatric male 5.00
SRR14267557 adult male 58.00
SRR14267558 adult female 89.00
SRR14267559 adult female 54.00
SRR14267560 adult female 76.00
SRR14267561 adult male 71.00
SRR14267562 adult male 40.00
SRR14267563 adult male 59.00
SRR14267564 adult female 58.00
SRR14267565 adult female 20.00
SRR14267566 adult female 82.00
Let's create some summary graphs, run the following code:
# sex distribution
# Creating a data frame with frequencies
sex <- data.frame(round(100*table(sampleInfo$sex)/nrow(sampleInfo),2))
colnames(sex) <- c("sex","freq")
sex
# Plotting pie chart
ggplot(sex, aes(x="", y=freq, fill=sex))+
geom_bar(stat = "identity") +
coord_polar("y", start=0) +
theme_void() +
geom_text(aes(label = paste0(freq, "%")), position = position_stack(vjust=0.5))
Click here for output
sex freq
1 female 42.86
2 male 57.14
Q4. What is the distribution between adults and children? Replace
sex
withgroup
in the previous code:
Click here for output
# groupdistribution
# Creating a data frame with frequenciesgroup <- data.frame(round(100*table(sampleInfo$group)/nrow(sampleInfo),2))
colnames(group) <- c("group","freq")
group
# Plotting pie chart
ggplot(group, aes(x="", y=freq, fill=group))+
geom_bar(stat = "identity") +
coord_polar("y", start=0) +
theme_void() +
geom_text(aes(label = paste0(freq, "%")), position = position_stack(vjust=0.5))
group freq
1 adult 71.43
2 pediatric 28.57
Now let's see how the ages are distributed across the groups, using a heatmap:
# age distribution
ggplot(sampleInfo, aes(x=age, fill=group)) +
geom_histogram( bins=10)
Click here for output
Q5. Create a boxplot showing the age distribution per group. Use
ggplot(... , aes(x=... , y=...)) + geom_boxplot()
Replace the...
for the data variable, the column you want in the x axis and the column you want in the y axis. You could have a look at this tutorial if you need help.
Click here for output
# age distribution
ggplot(sampleInfo,
aes(x=group, y=age, fill=group)) +
geom_boxplot()
Now we need to add the group information so we can test differences between them, and create an object (using the DESeqDataSetFromMatrix()
function) that merges the counts
data with the metadata
:
# Defining the sample grouping
condition <- data.frame(condition=sampleInfo$group)
condition
Click here for output
condition
1 adult
2 pediatric
3 pediatric
4 pediatric
5 adult
6 adult
7 adult
8 pediatric
9 pediatric
10 adult
11 pediatric
12 adult
13 adult
14 adult
15 adult
16 adult
17 adult
18 adult
19 adult
20 adult
21 adult
# Creating the DESeqDataSet
dds <- DESeqDataSetFromMatrix(countData=counts,
colData=condition,
design =~condition)
dds
Click here for output
class: DESeqDataSet
dim: 56648 21
metadata(1): version
assays(1): counts
rownames(56648): DDX11L1 WASH7P ... AC213203.1 FAM231D
rowData names(0):
colnames(21): SRR14267546 SRR14267547 ... SRR14267565 SRR14267566
colData names(1): condition
# Having a look at the data
head(assay(dds))
Click here for output
SRR14267546 SRR14267547 SRR14267548 SRR14267549 SRR14267550
DDX11L1 0 0 0 0 0
WASH7P 25 145 65 48 100
MIR6859-1 0 23 15 5 31
MIR1302-2HG 0 0 0 0 0
MIR1302-2 0 0 0 0 0
FAM138A 0 0 0 0 0
SRR14267551 SRR14267552 SRR14267553 SRR14267554 SRR14267555
DDX11L1 0 0 4 0 0
WASH7P 94 97 41 65 67
MIR6859-1 6 8 2 4 34
MIR1302-2HG 0 0 0 0 0
MIR1302-2 0 0 0 0 0
FAM138A 0 0 0 0 0
SRR14267556 SRR14267557 SRR14267558 SRR14267559 SRR14267560
DDX11L1 0 0 0 0 0
WASH7P 69 70 90 59 60
MIR6859-1 21 12 2 4 6
MIR1302-2HG 0 0 0 0 0
MIR1302-2 0 0 0 0 0
FAM138A 0 0 0 0 0
SRR14267561 SRR14267562 SRR14267563 SRR14267564 SRR14267565
DDX11L1 0 0 0 0 0
WASH7P 67 26 8 67 174
MIR6859-1 3 0 2 0 26
MIR1302-2HG 0 0 0 0 0
MIR1302-2 0 0 0 0 0
FAM138A 0 0 0 0 0
SRR14267566
DDX11L1 0
WASH7P 135
MIR6859-1 38
MIR1302-2HG 0
MIR1302-2 0
FAM138A 0
The next step is a quality check, we need to assess the overall similarity between the groups of samples (since we expect them to cluster based on their expression levels).
One way is through a PCA plot. For this we will use the rlogTransformation()
function to transform the count data to the log2scale, and then the plotPCA()
function for the plotting:
# Data transformation
rld <- rlogTransformation(dds)
rld
Click here for output
class: DESeqTransform
dim: 56648 21
metadata(1): version
assays(1): ''
rownames(56648): DDX11L1 WASH7P ... AC213203.1 FAM231D
rowData names(7): baseMean baseVar ... dispFit rlogIntercept
colnames(21): SRR14267546 SRR14267547 ... SRR14267565 SRR14267566
colData names(2): condition sizeFactor
# actual data
head(assay(rld))
Click here for output
SRR14267546 SRR14267547 SRR14267548 SRR14267549 SRR14267550
DDX11L1 -1.923273 -1.922569 -1.934395 -1.921665 -1.932248
WASH7P 5.115100 6.761642 5.595771 5.712984 6.064725
MIR6859-1 2.171499 3.607337 3.058519 2.725726 3.592356
MIR1302-2HG 0.000000 0.000000 0.000000 0.000000 0.000000
MIR1302-2 0.000000 0.000000 0.000000 0.000000 0.000000
FAM138A 0.000000 0.000000 0.000000 0.000000 0.000000
SRR14267551 SRR14267552 SRR14267553 SRR14267554 SRR14267555
DDX11L1 -1.932519 -1.932224 -1.672928 -1.922498 -1.925831
WASH7P 5.996572 6.036189 5.947714 5.968002 5.890281
MIR6859-1 2.644762 2.770045 2.564589 2.627999 3.828211
MIR1302-2HG 0.000000 0.000000 0.000000 0.000000 0.000000
MIR1302-2 0.000000 0.000000 0.000000 0.000000 0.000000
FAM138A 0.000000 0.000000 0.000000 0.000000 0.000000
SRR14267556 SRR14267557 SRR14267558 SRR14267559 SRR14267560
DDX11L1 -1.923994 -1.925799 -1.932200 -1.923333 -1.915023
WASH7P 5.976639 5.932493 5.965437 5.850565 6.132832
MIR6859-1 3.505530 3.096222 2.343601 2.617128 2.914208
MIR1302-2HG 0.000000 0.000000 0.000000 0.000000 0.000000
MIR1302-2 0.000000 0.000000 0.000000 0.000000 0.000000
FAM138A 0.000000 0.000000 0.000000 0.000000 0.000000
SRR14267561 SRR14267562 SRR14267563 SRR14267564 SRR14267565
DDX11L1 -1.921163 -1.900424 -1.874740 -1.922823 -1.925989
WASH7P 6.039762 5.812426 5.567783 5.986319 6.831103
MIR6859-1 2.549226 2.280471 3.031342 2.173275 3.615185
MIR1302-2HG 0.000000 0.000000 0.000000 0.000000 0.000000
MIR1302-2 0.000000 0.000000 0.000000 0.000000 0.000000
FAM138A 0.000000 0.000000 0.000000 0.000000 0.000000
SRR14267566
DDX11L1 -1.931788
WASH7P 6.374360
MIR6859-1 3.760624
MIR1302-2HG 0.000000
MIR1302-2 0.000000
FAM138A 0.000000
# Plotting PCA
plotPCA(rld)
Click here for output
Another way to evaluate how similar or different the samples are based on the overall difference on gene expression is plotting the euclidean distances. First we calculate the euclidean distance of the transformed data using the dist()
function and then we plot the matrix with the pheatmap()
function. In the code below, select the colors to be used in the heatmap. Make sure the name is within double quotes:
# Calculating distance among samples
sampleDist <- dist(t(assay(rld)))
sampleDist
Click here for output
SRR14267546 SRR14267547 SRR14267548 SRR14267549 SRR14267550 SRR14267551 SRR14267552 SRR14267553 SRR14267554 SRR14267555 SRR14267556
SRR14267547 137.84122
SRR14267548 101.33582 104.08855
SRR14267549 116.25124 117.75414 84.03942
SRR14267550 118.95525 138.91810 115.45593 135.38819
SRR14267551 126.61332 102.38889 88.91388 104.67426 135.16465
SRR14267552 115.93196 111.28757 99.08627 114.79165 108.45224 107.62522
SRR14267553 130.16385 103.09936 95.87417 105.15393 138.53522 93.67601 108.30340
SRR14267554 124.06040 126.10440 115.11415 117.65762 117.06748 120.20511 104.17597 122.99441
SRR14267555 137.64428 135.01956 126.05113 135.38028 130.61603 108.97371 119.38040 129.47393 126.50817
SRR14267556 146.33015 121.30757 110.67487 105.75208 163.40416 100.15707 127.60392 107.80711 132.00976 125.89138
SRR14267557 116.31309 128.10088 111.63899 118.54195 103.25506 117.55195 99.37439 124.46284 87.41683 111.24994 134.91532
SRR14267558 112.45035 121.35258 94.90413 101.53486 117.76473 107.25188 98.56184 116.12105 83.65582 120.08287 121.07971
SRR14267559 131.35214 120.52444 117.85610 124.36931 118.88494 119.29375 112.64062 122.20457 105.31552 128.62755 142.48848
SRR14267560 177.84567 150.66197 163.17784 155.96052 188.13818 143.43391 157.23966 152.02485 153.95972 131.63780 135.42347
SRR14267561 122.01652 126.20096 116.73750 125.49290 109.65263 119.86396 105.72186 129.29009 105.47264 107.22889 138.53505
SRR14267562 111.49063 131.50682 107.90759 122.49883 105.25880 134.68345 113.82406 131.08162 120.43249 144.10853 155.87432
SRR14267563 117.67961 134.36849 113.00920 113.09377 114.58256 132.35531 113.74973 133.96773 98.36067 133.70682 138.52119
SRR14267564 122.04712 126.11685 115.93985 119.71505 122.18646 126.55867 110.26409 130.22562 103.07408 132.53655 140.53813
SRR14267565 133.48310 135.49428 121.20408 129.81212 127.15459 138.11683 109.02993 136.27081 106.06974 135.77075 140.56196
SRR14267566 127.67278 135.15633 122.97428 130.45261 111.05707 132.62800 109.08286 137.59213 95.75607 107.58586 142.69650
...
# Creating a matrix
sampleDistMtx <- as.matrix(sampleDist)
sampleDistMtx
Click here for output
SRR14267546 SRR14267547 SRR14267548 SRR14267549 SRR14267550 SRR14267551 SRR14267552 SRR14267553 SRR14267554 SRR14267555 SRR14267556
SRR14267546 0.0000 137.8412 101.33582 116.25124 118.9553 126.61332 115.93196 130.16385 124.06040 137.6443 146.3302
SRR14267547 137.8412 0.0000 104.08855 117.75414 138.9181 102.38889 111.28757 103.09936 126.10440 135.0196 121.3076
SRR14267548 101.3358 104.0885 0.00000 84.03942 115.4559 88.91388 99.08627 95.87417 115.11415 126.0511 110.6749
SRR14267549 116.2512 117.7541 84.03942 0.00000 135.3882 104.67426 114.79165 105.15393 117.65762 135.3803 105.7521
SRR14267550 118.9553 138.9181 115.45593 135.38819 0.0000 135.16465 108.45224 138.53522 117.06748 130.6160 163.4042
SRR14267551 126.6133 102.3889 88.91388 104.67426 135.1646 0.00000 107.62522 93.67601 120.20511 108.9737 100.1571
SRR14267552 115.9320 111.2876 99.08627 114.79165 108.4522 107.62522 0.00000 108.30340 104.17597 119.3804 127.6039
SRR14267553 130.1639 103.0994 95.87417 105.15393 138.5352 93.67601 108.30340 0.00000 122.99441 129.4739 107.8071
SRR14267554 124.0604 126.1044 115.11415 117.65762 117.0675 120.20511 104.17597 122.99441 0.00000 126.5082 132.0098
SRR14267555 137.6443 135.0196 126.05113 135.38028 130.6160 108.97371 119.38040 129.47393 126.50817 0.0000 125.8914
SRR14267556 146.3302 121.3076 110.67487 105.75208 163.4042 100.15707 127.60392 107.80711 132.00976 125.8914 0.0000
SRR14267557 116.3131 128.1009 111.63899 118.54195 103.2551 117.55195 99.37439 124.46284 87.41683 111.2499 134.9153
SRR14267558 112.4504 121.3526 94.90413 101.53486 117.7647 107.25188 98.56184 116.12105 83.65582 120.0829 121.0797
SRR14267559 131.3521 120.5244 117.85610 124.36931 118.8849 119.29375 112.64062 122.20457 105.31552 128.6276 142.4885
SRR14267560 177.8457 150.6620 163.17784 155.96052 188.1382 143.43391 157.23966 152.02485 153.95972 131.6378 135.4235
SRR14267561 122.0165 126.2010 116.73750 125.49290 109.6526 119.86396 105.72186 129.29009 105.47264 107.2289 138.5351
SRR14267562 111.4906 131.5068 107.90759 122.49883 105.2588 134.68345 113.82406 131.08162 120.43249 144.1085 155.8743
SRR14267563 117.6796 134.3685 113.00920 113.09377 114.5826 132.35531 113.74973 133.96773 98.36067 133.7068 138.5212
SRR14267564 122.0471 126.1169 115.93985 119.71505 122.1865 126.55867 110.26409 130.22562 103.07408 132.5366 140.5381
SRR14267565 133.4831 135.4943 121.20408 129.81212 127.1546 138.11683 109.02993 136.27081 106.06974 135.7707 140.5620
SRR14267566 127.6728 135.1563 122.97428 130.45261 111.0571 132.62800 109.08286 137.59213 95.75607 107.5859 142.6965
# Adding colors according to the sample group
annotation_col<-data.frame(Group=sampleInfo$group)
rownames(sampleInfo) -> rownames(annotation_col)
annotation_col
Click here for output
condition
SRR14267546 adult
SRR14267547 pediatric
SRR14267548 pediatric
SRR14267549 pediatric
SRR14267550 adult
SRR14267551 adult
SRR14267552 adult
SRR14267553 pediatric
SRR14267554 pediatric
SRR14267555 adult
SRR14267556 pediatric
SRR14267557 adult
SRR14267558 adult
SRR14267559 adult
SRR14267560 adult
SRR14267561 adult
SRR14267562 adult
SRR14267563 adult
SRR14267564 adult
SRR14267565 adult
SRR14267566 adult
# Defining the colors to use - SELECT THE COLORS TO BE USED
annotation_colors <- list(Group = c(adult="TYPE_YOUR_COLOR_HERE", pediatric="TYPE_YOUR_COLOR_HERE"))
annotation_colors
Click here for output
annotation_colors <- list(Group = c(adult="deeppink3", pediatric="cadetblue2"))
annotation_colors
$Group
adult pediatric
"deeppink3" "cadetblue2"
# Plotting the distances
pheatmap(sampleDistMtx, cluster_rows=sampleDist, cluster_cols=sampleDist,
main="Sample distances", fontsize_row=12,
annotation_col=annotation_col, annotation_colors=annotation_colors)
Click here for output
Q6. What do you conclude from the clustering? Do the PCA and the heatmap reflect each other? Are there any sample outliers?
Click here for output
Most of the adult samples cluster together,
the PCA reflects a quite similar clustering as the heatmap
Would be good to investigate the Sample SRR14267554
(female, 1 year old) as it doesn't cluster with the
pediatric samples
As for the differential expression testing we will use the DESeq()
function. This performs three steps:
- Compute a scaling factor for each sample to account for differences in read depth and complexity between samples
- Estimate the variance among samples
- Test for differences in expression among groups
# DE test
dds <- DESeq(dds)
head(dds)
Click here for output
class: DESeqDataSet
dim: 6 21
metadata(1): version
assays(6): counts mu ... replaceCounts replaceCooks
rownames(6): DDX11L1 WASH7P ... MIR1302-2 FAM138A
rowData names(23): baseMean baseVar ... maxCooks replace
colnames(21): SRR14267546 SRR14267547 ... SRR14267565 SRR14267566
colData names(3): condition sizeFactor replaceable
# Extracting the results
res <- results(dds)
head(res)
Click here for output
log2 fold change (MLE): condition pediatric vs adult
Wald test p-value: condition pediatric vs adult
DataFrame with 6 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
DDX11L1 0.28342 2.3988810 3.290818 0.728962 0.466025 NA
WASH7P 67.20353 0.0400727 0.297838 0.134545 0.892971 0.977001
MIR6859-1 9.85364 0.1083398 0.775606 0.139684 0.888910 0.976695
MIR1302-2HG 0.00000 NA NA NA NA NA
MIR1302-2 0.00000 NA NA NA NA NA
FAM138A 0.00000 NA NA NA NA NA
It is important to remember that since we are doing a lot of tests, as many tests as genes in our experiment, we need to correct our pvalues
. DESeq()
already did this while testing for differences among our groups, so let's create a histogram of these adjusted pvalues
and an MA-plot
to inspect if there is any enrichment of DE genes. In the code below, select a number that gives the breakpoints between histogram cells, and the colors to use in the histogram. Make sure the name is within double quotes:
# Padjusted histogram - SELECT THE BREAKS AND COLORS TO USE
hist(res$padj, breaks=TYPE_A_NUMBER_HERE, col="TYPE_A_COLOR_HERE", border="TYPE_A_COLOR_HERE", main="child_vs_adult", xlab="padj")
Click here for output
hist(res$padj, breaks=100,
col="skyblue", border="slateblue",
main="child_vs_adult", xlab="padj")
# MA-plot
plotMA(res, main="child_vs_adult")
Click here for output
Every point in the MA plot
is a gene. Genes that have a padj < 0.1 (default significance level threshold) will be seen in blue/red. If you see any triangles in the plot, then those genes are outside the limits of the graph.
Run plotMA()
again, but add alpha=0.01
to show only genes with a padj<0.01.
Click here for output
# MA-plot
plotMA(res, main="child_vs_adult", alpha=0.01, ylim=c(-7,8))
Q7. What does the histogram and the MA plot tell you?
Click here for output
From the histogram, we can see that most of the genes
are not differentially expressed,
but there are some (<0.01) that are different among
the groups
From the MA plot, we can see even if these differences
are up or down regulated, based on the foldchange
There are genes that have very low counts across all the samples. These can't be tested for DE so they end up with an NA padj
, therefore these need to be removed:
# checking how many genes we have
nrow(res)
Click here for output
[1] 56648
# checking how many genes have NA
sum(is.na(res$padj))
Click here for output
[1] 34111
# removing the NA padj
res <- res[ ! is.na(res$padj), ]
# checking how many genes have
nrow(res)
Click here for output
[1] 22537
Q8. How many genes were removed?
Click here for output
34111/56648 = 0.6021572
We can filter our significant genes based on the false discovery rate
(FDR). This will remove genes with no actual difference in expression. In the code below, select an FDR threshold. Make sure the number is not within double quotes:
# Selecting significant genes - SELECT THE THRESHOLD TO USE
sig <- which(res$padj < TYPE_AN_FDR_THRESHOLD_HERE)
sig
Click here for output
sig <- which(res$padj < 0.01)
[1] 123 150 260 1006 1022 1076 1122 1248 1313 1404 1654 2055 2150 3503 3615 4388 4435 4491 4557 5350
[21] 5525 5547 5647 5836 5871 6349 6350 6381 6577 6591 6643 6671 6677 6701 6719 7148 7270 7278 7541 7734
[41] 7995 8082 8397 8480 8941 9576 10017 10018 10070 10166 10167 10211 10310 10324 10379 10609 10649 10845 10896 11484
[61] 11502 11531 11532 11534 11673 11675 12264 12677 12679 12689 13027 13512 13573 14282 14321 14386 16398 16578 16707 16708
[81] 16710 16711 16935 16959 16970 17024 17103 17355 17401 17533 19412 19503 19879 21063 21167 21920 22066
# counting the number of genes
length(sig)
Click here for output
[1] 97
# Extracting the data based on the significant genes
res_sig <- res[sig,]
res_sig
Click here for output
log2 fold change (MLE): condition pediatric vs adult
Wald test p-value: condition pediatric vs adult
DataFrame with 97 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
AJAP1 138.8402 -2.75858 0.609040 -4.52940 5.91520e-06 0.002898063
TNFRSF9 87.2684 3.17481 0.732593 4.33366 1.46648e-05 0.005330646
CLCNKB 238.8856 -3.25651 0.617002 -5.27797 1.30626e-07 0.000486443
CLCA2 30.9664 -2.51632 0.596012 -4.22193 2.42216e-05 0.007376789
GBP1 3033.9747 1.99502 0.476120 4.19016 2.78756e-05 0.007952296
... ... ... ... ... ... ...
TGM2 3041.7041 1.67199 0.397186 4.20960 2.55823e-05 0.00768731
CYP2B6 419.8018 -1.99239 0.486809 -4.09276 4.26265e-05 0.00990385
BCAM 443.7774 -1.33175 0.297989 -4.46913 7.85370e-06 0.00345258
SEC14L3 931.7518 -2.41949 0.576828 -4.19448 2.73496e-05 0.00795230
CACNA1I 52.4173 2.29684 0.522092 4.39931 1.08597e-05 0.00429367
# Sorting the data based on foldchange
res_sig_sorted<-res_sig[order(res_sig$log2FoldChange),]
res_sig_sorted
Click here for output
log2 fold change (MLE): condition pediatric vs adult
Wald test p-value: condition pediatric vs adult
DataFrame with 97 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
YWHAEP1 64.7893 -6.93362 1.03373 -6.70735 1.98185e-11 4.46649e-07
SALL1 12.1914 -6.59538 1.50047 -4.39554 1.10500e-05 4.29367e-03
NECAB1 32.7997 -6.18288 1.48307 -4.16898 3.05962e-05 8.52884e-03
ISLR 17.6368 -6.15124 1.31620 -4.67348 2.96132e-06 2.36016e-03
DPYS 197.6045 -4.96200 1.08441 -4.57578 4.74448e-06 2.59006e-03
... ... ... ... ... ... ...
IDO1 1893.8767 3.23386 0.618710 5.22677 1.72493e-07 4.86443e-04
BATF3 40.2398 3.26007 0.630897 5.16736 2.37426e-07 4.86443e-04
KDR 48.3249 3.41633 0.594774 5.74391 9.25149e-09 5.21252e-05
TRIM31 894.8934 3.42141 0.587097 5.82766 5.62091e-09 4.22262e-05
TNIP3 123.9705 3.62437 0.685931 5.28386 1.26488e-07 4.86443e-04
Replace log2FoldChange
with padj
. You will have your list sorted by padj. The top genes may or not be the same.
Click here for output
res_sig_sorted<-res_sig[order(res_sig$padj),]
res_sig_sorted
log2 fold change (MLE): condition pediatric vs adult
Wald test p-value: condition pediatric vs adult
DataFrame with 97 rows and 6 columns
baseMean log2FoldChange lfcSE stat pvalue padj
<numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
YWHAEP1 64.7893 -6.93362 1.033734 -6.70735 1.98185e-11 4.46649e-07
AKR1C3 333.0180 -1.41125 0.225245 -6.26539 3.71898e-10 4.19073e-06
TRIM31 894.8934 3.42141 0.587097 5.82766 5.62091e-09 4.22262e-05
KDR 48.3249 3.41633 0.594774 5.74391 9.25149e-09 5.21252e-05
CLCNKB 238.8856 -3.25651 0.617002 -5.27797 1.30626e-07 4.86443e-04
... ... ... ... ... ... ...
HBG1 150.4307 -2.88671 0.703055 -4.10595 4.02650e-05 0.00965374
SOCS2-AS1 12.4797 2.80607 0.683309 4.10659 4.01537e-05 0.00965374
TSKU 568.5625 -1.69189 0.413217 -4.09442 4.23226e-05 0.00990385
FHOD3 176.1418 -2.28741 0.558885 -4.09281 4.26181e-05 0.00990385
CYP2B6 419.8018 -1.99239 0.486809 -4.09276 4.26265e-05 0.00990385
To see the top 20 most down-regulated genes:
sig_up_20 <- res_sig[ order(res_sig$log2FoldChange, decreasing=FALSE),][1:20,]
knitr::kable(sig_up_20, caption = "20 most up-regulated genes")
Click here for output
Table: 20 most down-regulated genes
| | baseMean| log2FoldChange| lfcSE| stat| pvalue| padj|
|:----------|-----------:|--------------:|---------:|---------:|--------:|---------:|
|YWHAEP1 | 64.789310| -6.933619| 1.0337339| -6.707354| 0.00e+00| 0.0000004|
|SALL1 | 12.191405| -6.595382| 1.5004725| -4.395537| 1.10e-05| 0.0042937|
|NECAB1 | 32.799661| -6.182884| 1.4830678| -4.168983| 3.06e-05| 0.0085288|
|ISLR | 17.636786| -6.151244| 1.3162007| -4.673485| 3.00e-06| 0.0023602|
|DPYS | 197.604459| -4.962000| 1.0844052| -4.575780| 4.70e-06| 0.0025901|
|CCBE1 | 46.698100| -4.723292| 1.1066847| -4.267965| 1.97e-05| 0.0064431|
|HBD | 1134.425336| -4.418546| 1.0123354| -4.364706| 1.27e-05| 0.0048624|
|ELN | 218.650096| -4.362751| 0.9359454| -4.661331| 3.10e-06| 0.0023602|
|XKR4 | 8.380051| -4.181767| 0.9979247| -4.190463| 2.78e-05| 0.0079523|
|HBE1 | 891.748251| -4.162405| 0.9042269| -4.603275| 4.20e-06| 0.0024022|
|CALML3 | 32.823929| -3.953432| 0.9141661| -4.324632| 1.53e-05| 0.0053802|
|HBM | 201.841670| -3.877859| 0.8307647| -4.667819| 3.00e-06| 0.0023602|
|HBA1 | 55.400078| -3.816849| 0.7577836| -5.036859| 5.00e-07| 0.0007110|
|STAB2 | 13.485268| -3.667703| 0.8082696| -4.537722| 5.70e-06| 0.0028479|
|GAS1 | 83.360828| -3.445721| 0.6649050| -5.182276| 2.00e-07| 0.0004864|
|PKHD1L1 | 56.813516| -3.445545| 0.7705396| -4.471601| 7.80e-06| 0.0034526|
|HBQ1 | 230.539845| -3.417155| 0.8259763| -4.137111| 3.52e-05| 0.0089061|
|CLCNKB | 238.885618| -3.256515| 0.6170018| -5.277966| 1.00e-07| 0.0004864|
|AC092691.1 | 77.660374| -2.975431| 0.6924747| -4.296809| 1.73e-05| 0.0059168|
|HBZ | 57.464223| -2.968994| 0.6451887| -4.601745| 4.20e-06| 0.0024022|
To see the top 20 most up-regulated genes, run the same code replacing decreasing=FALse
with decreasing=TRUE
Click here for output
sig_down_20 <- res_sig[ order(res_sig$log2FoldChange, decreasing=TRUE),][1:20,]
knitr::kable(sig_down_20, caption = "20 most down-regulated genes")
Table: 20 most up-regulated genes
| | baseMean| log2FoldChange| lfcSE| stat| pvalue| padj|
|:----------|-----------:|--------------:|---------:|--------:|--------:|---------:|
|TNIP3 | 123.970507| 3.624368| 0.6859315| 5.283863| 1.00e-07| 0.0004864|
|TRIM31 | 894.893431| 3.421406| 0.5870974| 5.827662| 0.00e+00| 0.0000422|
|KDR | 48.324922| 3.416330| 0.5947744| 5.743910| 0.00e+00| 0.0000521|
|BATF3 | 40.239837| 3.260073| 0.6308975| 5.167358| 2.00e-07| 0.0004864|
|IDO1 | 1893.876694| 3.233857| 0.6187098| 5.226774| 2.00e-07| 0.0004864|
|TNFRSF9 | 87.268403| 3.174813| 0.7325933| 4.333664| 1.47e-05| 0.0053306|
|MIR3142HG | 47.735496| 2.978833| 0.5863762| 5.080071| 4.00e-07| 0.0006074|
|MTNR1A | 36.191006| 2.973339| 0.6439176| 4.617576| 3.90e-06| 0.0024022|
|AC007991.2 | 59.982815| 2.965510| 0.7135022| 4.156273| 3.23e-05| 0.0086789|
|MYH11 | 6.553078| 2.850903| 0.6383442| 4.466091| 8.00e-06| 0.0034526|
|SOCS2-AS1 | 12.479696| 2.806074| 0.6833094| 4.106593| 4.02e-05| 0.0096537|
|MUC13 | 1707.559210| 2.722444| 0.5505521| 4.944934| 8.00e-07| 0.0009537|
|ADAM19 | 861.776573| 2.576332| 0.5636395| 4.570886| 4.90e-06| 0.0025901|
|HAPLN3 | 435.013856| 2.427905| 0.4886612| 4.968483| 7.00e-07| 0.0008946|
|SAA2 | 1532.460356| 2.352015| 0.5149745| 4.567245| 4.90e-06| 0.0025901|
|CACNA1I | 52.417348| 2.296844| 0.5220923| 4.399306| 1.09e-05| 0.0042937|
|HLA-DOA | 352.389635| 2.286316| 0.5261862| 4.345071| 1.39e-05| 0.0052297|
|SAA4 | 47.574509| 2.277055| 0.4899203| 4.647807| 3.40e-06| 0.0023813|
|SH3TC2 | 89.780154| 2.126087| 0.4260279| 4.990488| 6.00e-07| 0.0008483|
|GBP1 | 3033.974711| 1.995022| 0.4761204| 4.190162| 2.79e-05| 0.0079523|
Let's create a heatmap of the significantly DE genes:
# Selecting the rlog-transformed counts of the DE genes
sig2heatmap <- assay(rld)[rownames(assay(rld))%in%rownames(res_sig),]
# Calculating the difference of the counts to the mean value
sig2heatmap_Mean <- sig2heatmap - rowMeans(sig2heatmap)
# Plotting
pheatmap(sig2heatmap_Mean,
main="Child vs Adult significant DE genes (padj 0.01)",
annotation_col=annotation_col,
annotation_colors=annotation_colors,
show_rownames=FALSE)
Click here for output
And if we focus on the top 20 most significant genes, we can better pinpoint the important genes:
# Selecting the top 20 genes based on adjusted pvalue
res_top <-res_sig[order(res_sig$padj),][1:20,]
# Selecting the rlog-transformed counts of the DE genes
top2heatmap <- assay(rld)[rownames(assay(rld))%in%rownames(res_top),]
# Calculating the difference of the counts to the mean value
top2heatmap_Mean <- top2heatmap - rowMeans(top2heatmap)
# Plotting
pheatmap(top2heatmap_Mean,
main="Child vs Adult 20 top significant DE genes (padj 0.01)",
annotation_col=annotation_col,
annotation_colors=annotation_colors)
Click here for output
Another way to visualize the data is through a Volcano plot, this will plot all genes. Here we will use the EnhancedVolcano()
function. In the code below, select a pvalue and fold change threshold, as well as a title to be displayed.:
EnhancedVolcano(res, # data to plot
lab = rownames(res), # labels (protein name)
x = "log2FoldChange", # value to plot in the X-axis
y = "pvalue", # value to plot in the Y-axis
pCutoff = ADD_YOUR_THRESHOLD_HERE, # cut off for the pvalue
FCcutoff = ADD_YOUR_THRESHOLD_HERE, # cut off for the fold change
title = "ADD_YOUR_TITLE_HERE", # Title of the plot
subtitle = NULL, # Remove subtitle to create more space for the plot
caption = NULL, # Remove caption to create more space for the plot
# (if you remove this line you will get the number of proteins plotted)
legendPosition = "top", # Position the legend on top of the plot
axisLabSize = 11, # Set font size for axis labels
labSize = 2.8, # Set font size for protein labels
xlim = c(-3,3), # Set x-axis limits to -3 and 3 so the plot is symmetric
ylim = c(0, 7))
Click here for output
EnhancedVolcano(res, # data to plot
lab = rownames(res), # labels (protein name)
x = "log2FoldChange", # value to plot in the X-axis
y = "pvalue", # value to plot in the Y-axis
pCutoff = 0.01, # cut off for the pvalue
FCcutoff = 2, # cut off for the fold change
title = "DE expression: early mucosal immune response in children compared with adults", # Title of the plot
subtitle = NULL, # Remove subtitle to create more space for the plot
caption = NULL, # Remove caption to create more space for the plot
# (if you remove this line you will get the number of proteins plotted)
legendPosition = "top", # Position the legend on top of the plot
axisLabSize = 11, # Set font size for axis labels
labSize = 2.8) # Set font size for protein labels
Well done! now you have the general workflow to identify differentially expressed genes!
Developed by Marcela Dávila, 2021. Modified by Marcela Dávila, 2022. Updated by Marcela Dávila, 2023. Updated by Marcela Dávila, 2024.