sample_and_replicate_QC - trinityrnaseq/BerlinTrinityWorkshop2018 GitHub Wiki

Expression analysis and functional annotation

A plethora of tools are currently available for identifying differentially expressed transcripts based on RNA-Seq data, and of these, edgeR and DESeq2 are very popular and highly accurate. The edgeR software is part of the R Bioconductor package, and we provide support for using it in the Trinity package.

Having biological replicates for each of your samples is crucial for accurate detection of differentially expressed transcripts. In our data set, we have three biological replicates for each of our conditions, and in general, having three or more replicates for each experimental condition is highly recommended.

Create a samples.txt file containing the contents below (tab-delimited), indicating the name of the condition followed by the name of the biological replicate. The replicate names must match up with the column headings of your counts matrix:

%  head -n1 Trinity_trans.isoform.counts.matrix | tee samples.txt

wt37_rep1_kallisto	wt37_rep2_kallisto	wt37_rep3_kallisto	GSNO_rep1_kallisto	GSNO_rep2_kallisto	GSNO_rep3_kallisto	ph8_rep1_kallisto	ph8_rep2_kallisto	ph8_rep3_kallisto

Now edit file 'samples.txt' to contain the tab-delimited 2-column format:

sample_name    unique_replicate_name

Note, there are several text editors installed for use including nano, vim, and emacs. The 'nano' utility is the simplest, while the others are more expert-friendly.

and it should look like so:

% cat samples.txt

wt37	wt37_rep1_salmon
wt37	wt37_rep2_salmon
wt37	wt37_rep3_salmon
GSNO	GSNO_rep1_salmon
GSNO	GSNO_rep2_salmon
GSNO	GSNO_rep3_salmon
ph8	ph8_rep1_salmon
ph8	ph8_rep2_salmon
ph8	ph8_rep3_salmon

Another thing to validate would be to ensure that there are physical tab characters that delimit the columns. You can do this using 'cat -te filename' like so:

% cat -te samples.txt

wt37^Iwt37_rep1_salmon$
wt37^Iwt37_rep2_salmon$
wt37^Iwt37_rep3_salmon$
GSNO^IGSNO_rep1_salmon$
GSNO^IGSNO_rep2_salmon$
GSNO^IGSNO_rep3_salmon$
ph8^Iph8_rep1_salmon$
ph8^Iph8_rep2_salmon$
ph8^Iph8_rep3_salmon$

Here, you should see '^I' at positions of tab characters, and '$' at newline characters. This is invaluable for verifying contents of text files when the formatting has to be very precise. If your view doesn't look exactly as above, then you'll need to re-edit your file and be sure to replace any space characters you see with single tab characters, then re-examine it using the 'cat -te' approach.

The condition name (left column) can be named more or less arbitrarily but should reflect your experimental condition. Importantly, again, the replicate names (right column) need to match up exactly with the column headers of your RNA-Seq counts matrix.

Unsupervised Data Exploration

Before delving into differential expression analysis, you should examine your data to determine if there are any confounding issues. For example, your experimental replicates should look highly similar, and they should ideally be more similar to each other than to different sample types, assuming that the signal in the data is stronger than noise.

Trinity comes with a 'PtR' script that we use to simplify making various charts and plots based on a matrix input file. We'll use PtR for our plotting below. Note, PtR uses the 'R' software and environment, essentially generating an R script to generate an image file in pdf format, which you can view using the 'xpdf' utility.

Compare your replicates

Run the following to compare your biological replicates:

% $TRINITY_HOME/Analysis/DifferentialExpression/PtR \
 -m Trinity_trans.isoform.counts.matrix -s samples.txt \
 --log2 --compare_replicates

This should have generated pdf files, with one pdf report for each of your sample types. See what files were most recently created (via. 'ls -ltr') and you should find the following:

-rw-r--r-- 1 brian brian    480868 Jun  1 19:26 wt37.rep_compare.pdf
-rw-r--r-- 1 brian brian    474434 Jun  1 19:26 GSNO.rep_compare.pdf
-rw-r--r-- 1 brian brian    479867 Jun  1 19:26 ph8.rep_compare.pdf

View each of these pdf-formatted reports in your browser.

You should be viewing a multi-page pdf report file, where the views are in the following order:

Counts of reads mapping to transcripts (page 1/4):

Pairwise scatter plot of log2(read counts) per transcript (page 2/4):

Pairwise MA plots (M = log2(avg. read counts), A = log2(difference in read counts)) (page 3/4):

Pearson correlation between log2(replicate counts) (page 4/4):

Compare your replicates across all samples

One common way to compare the sample replicates in the context of all samples is to generate a replicate correlation plot containing all replicates across all samples:

% $TRINITY_HOME/Analysis/DifferentialExpression/PtR \
      -m Trinity_trans.isoform.TMM.EXPR.matrix -s samples.txt \
      --log2 --sample_cor_matrix

This should generate a heatmap. (use 'ls -ltr' to find the name of the file, which should be 'Trinity_trans.counts.matrix.log2.sample_cor_matrix.pdf'). View it within the browser.

Ideally, you should see that your replicates are well correlated and cluster together according to sample type, being more similar to each other than to the different samples. This will generally be the case when the signal in the data is strong.

Another powerful technique for comparing samples and replicates is to use Principal Components Analysis (PCA). Run PCA on your data like so:

% $TRINITY_HOME/Analysis/DifferentialExpression/PtR\
       -m Trinity_trans.isoform.TMM.EXPR.matrix -s samples.txt\
       --log2 --CPM --prin_comp 3

This generates a pdf file containing the PCA plots. Find it by 'ls -ltr' and view it using 'xpdf':

PCA is one of the best ways to identify potentially confounding effects in your data. If your replicates do not cluster according to sample and instead cluster according to some other aspect such as by the person who processed the replicates, the machine the replicates were sequenced on, etc., then you have batch effects that will need to be accounted for. It's best to avoid potential batch effects whenever and wherever possible.