sample_and_replicate_QC - trinityrnaseq/BerlinTrinityWorkshop2018 GitHub Wiki
Expression analysis and functional annotation
A plethora of tools are currently available for identifying differentially expressed transcripts based on RNA-Seq data, and of these, edgeR and DESeq2 are very popular and highly accurate. The edgeR software is part of the R Bioconductor package, and we provide support for using it in the Trinity package.
Having biological replicates for each of your samples is crucial for accurate detection of differentially expressed transcripts. In our data set, we have three biological replicates for each of our conditions, and in general, having three or more replicates for each experimental condition is highly recommended.
Create a samples.txt file containing the contents below (tab-delimited), indicating the name of the condition followed by the name of the biological replicate. The replicate names must match up with the column headings of your counts matrix:
% head -n1 Trinity_trans.isoform.counts.matrix | tee samples.txt
.
wt37_rep1_kallisto wt37_rep2_kallisto wt37_rep3_kallisto GSNO_rep1_kallisto GSNO_rep2_kallisto GSNO_rep3_kallisto ph8_rep1_kallisto ph8_rep2_kallisto ph8_rep3_kallisto
Now edit file 'samples.txt' to contain the tab-delimited 2-column format:
sample_name unique_replicate_name
Note, there are several text editors installed for use including nano, vim, and emacs. The 'nano' utility is the simplest, while the others are more expert-friendly.
and it should look like so:
% cat samples.txt
.
wt37 wt37_rep1_salmon
wt37 wt37_rep2_salmon
wt37 wt37_rep3_salmon
GSNO GSNO_rep1_salmon
GSNO GSNO_rep2_salmon
GSNO GSNO_rep3_salmon
ph8 ph8_rep1_salmon
ph8 ph8_rep2_salmon
ph8 ph8_rep3_salmon
Another thing to validate would be to ensure that there are physical tab characters that delimit the columns. You can do this using 'cat -te filename' like so:
% cat -te samples.txt
.
wt37^Iwt37_rep1_salmon$
wt37^Iwt37_rep2_salmon$
wt37^Iwt37_rep3_salmon$
GSNO^IGSNO_rep1_salmon$
GSNO^IGSNO_rep2_salmon$
GSNO^IGSNO_rep3_salmon$
ph8^Iph8_rep1_salmon$
ph8^Iph8_rep2_salmon$
ph8^Iph8_rep3_salmon$
Here, you should see '^I' at positions of tab characters, and '$' at newline characters. This is invaluable for verifying contents of text files when the formatting has to be very precise. If your view doesn't look exactly as above, then you'll need to re-edit your file and be sure to replace any space characters you see with single tab characters, then re-examine it using the 'cat -te' approach.
The condition name (left column) can be named more or less arbitrarily but should reflect your experimental condition. Importantly, again, the replicate names (right column) need to match up exactly with the column headers of your RNA-Seq counts matrix.
Unsupervised Data Exploration
Before delving into differential expression analysis, you should examine your data to determine if there are any confounding issues. For example, your experimental replicates should look highly similar, and they should ideally be more similar to each other than to different sample types, assuming that the signal in the data is stronger than noise.
Trinity comes with a 'PtR' script that we use to simplify making various charts and plots based on a matrix input file. We'll use PtR for our plotting below. Note, PtR uses the 'R' software and environment, essentially generating an R script to generate an image file in pdf format, which you can view using the 'xpdf' utility.
Compare your replicates
Run the following to compare your biological replicates:
% $TRINITY_HOME/Analysis/DifferentialExpression/PtR \
-m Trinity_trans.isoform.counts.matrix -s samples.txt \
--log2 --compare_replicates
This should have generated pdf files, with one pdf report for each of your sample types. See what files were most recently created (via. 'ls -ltr') and you should find the following:
-rw-r--r-- 1 brian brian 480868 Jun 1 19:26 wt37.rep_compare.pdf
-rw-r--r-- 1 brian brian 474434 Jun 1 19:26 GSNO.rep_compare.pdf
-rw-r--r-- 1 brian brian 479867 Jun 1 19:26 ph8.rep_compare.pdf
View each of these pdf-formatted reports in your browser.
You should be viewing a multi-page pdf report file, where the views are in the following order:
Counts of reads mapping to transcripts (page 1/4):
Pairwise scatter plot of log2(read counts) per transcript (page 2/4):
Pairwise MA plots (M = log2(avg. read counts), A = log2(difference in read counts)) (page 3/4):
Pearson correlation between log2(replicate counts) (page 4/4):
Compare your replicates across all samples
One common way to compare the sample replicates in the context of all samples is to generate a replicate correlation plot containing all replicates across all samples:
% $TRINITY_HOME/Analysis/DifferentialExpression/PtR \
-m Trinity_trans.isoform.TMM.EXPR.matrix -s samples.txt \
--log2 --sample_cor_matrix
This should generate a heatmap. (use 'ls -ltr' to find the name of the file, which should be 'Trinity_trans.counts.matrix.log2.sample_cor_matrix.pdf'). View it within the browser.
Ideally, you should see that your replicates are well correlated and cluster together according to sample type, being more similar to each other than to the different samples. This will generally be the case when the signal in the data is strong.
Another powerful technique for comparing samples and replicates is to use Principal Components Analysis (PCA). Run PCA on your data like so:
% $TRINITY_HOME/Analysis/DifferentialExpression/PtR\
-m Trinity_trans.isoform.TMM.EXPR.matrix -s samples.txt\
--log2 --CPM --prin_comp 3
This generates a pdf file containing the PCA plots. Find it by 'ls -ltr' and view it using 'xpdf':
PCA is one of the best ways to identify potentially confounding effects in your data. If your replicates do not cluster according to sample and instead cluster according to some other aspect such as by the person who processed the replicates, the machine the replicates were sequenced on, etc., then you have batch effects that will need to be accounted for. It's best to avoid potential batch effects whenever and wherever possible.