expression_quantification - trinityrnaseq/BerlinTrinityWorkshop2018 GitHub Wiki
Transcript expression quantitation
To estimate the expression levels of the Trinity-reconstructed transcripts, we use the strategy supported by the RSEM software involving read alignment followed by expectation maximization to assign reads according to maximum likelihood estimates. In essence, we first align the original rna-seq reads back against the Trinity transcripts, then run RSEM to estimate the number of rna-seq fragments that map to each contig. Because the abundance of individual transcripts may significantly differ between samples, the reads from each sample (and each biological replicate) must be examined separately, obtaining sample-specific abundance values.
Abundance estimation using RSEM
We include a script to facilitate running of RSEM on Trinity transcript assemblies. The script we execute below will run the Bowtie aligner to align reads to the Trinity transcripts, and RSEM will then evaluate those alignments to estimate expression values. Again, we need to run this separately for each sample and biological replicate (ie. each pair of fastq files).
Let's start with one of the wt37 fastq pairs like so:
% $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
--left wt37_rep1_1.fastq \
--right wt37_rep1_2.fastq \
--transcripts Trinity.fasta \
--est_method RSEM --aln_method bowtie \
--trinity_mode \
--output_dir wt37_rep1_rsem \
--prep_reference
The outputs generated from running the command above will exist in the wt37_rep1_rsem/ directory, as we indicate with the --output_dir parameter above. Since this is the first time we're running RSEM, we need to prepare the Trinity.fasta file for alignment with the bowtie software, and so we'll simply include the --prep_reference parameter to invoke that functionality.
The primary output generated by RSEM is the file containing the expression values for each of the transcripts. Examine this file like so:
% head wt37_rep1_rsem/RSEM.isoforms.results | column -t
and you should see the top of a tab-delimited file:
transcript_id gene_id length effective_length expected_count TPM FPKM IsoPct
TRINITY_DN0_c0_g1_i1 TRINITY_DN0_c0_g1 2010 1881.38 225.00 72.01 65.65 100.00
TRINITY_DN0_c0_g2_i1 TRINITY_DN0_c0_g2 2011 1882.38 0.00 0.00 0.00 100.00
TRINITY_DN1000_c0_g1_i1 TRINITY_DN1000_c0_g1 1664 1535.38 208.00 81.57 74.37 100.00
TRINITY_DN1001_c0_g1_i1 TRINITY_DN1001_c0_g1 2948 2819.38 8.00 1.71 1.56 100.00
TRINITY_DN1002_c0_g1_i1 TRINITY_DN1002_c0_g1 351 222.54 1.00 2.71 2.47 100.00
TRINITY_DN1003_c0_g1_i1 TRINITY_DN1003_c0_g1 1913 1784.38 91.00 30.71 28.00 100.00
TRINITY_DN1004_c0_g1_i1 TRINITY_DN1004_c0_g1 2395 2266.38 167.00 44.37 40.45 100.00
TRINITY_DN1005_c0_g1_i1 TRINITY_DN1005_c0_g1 3076 2947.38 425.00 86.82 79.16 100.00
TRINITY_DN1006_c0_g1_i1 TRINITY_DN1006_c0_g1 1141 1012.38 253.00 150.47 137.19 100.00
The key columns in the above RSEM output are the transcript identifier, the 'expected_count' corresponding to the number of RNA-Seq fragments predicted to be derived from that transcript, and the 'TPM' or 'FPKM' columns, which provide normalized expression values for the expression of that transcript in the sample.
The FPKM expression measurement normalizes read counts according to the length of transcripts from which they are derived (as longer transcripts generate more reads at the same expression level), and normalized according to sequencing depth. The FPKM acronym stand for 'fragments per kilobase of cDNA per million fragments mapped'.
TPM 'transcripts per million' is generally the favored metric, as all TPM values should sum to 1 million, and TPM nicely reflects the relative molar concentration of that transcript in the sample. FPKM values, on the other hand, do not always sum to the same value, and do not have the similar property of inherrently representing a proportion within a sample, making comparisons between samples less straightforward. TPM values can be easily computed from FPKM values like so: TPMi = FPKMi / (sum all FPKM values) * 1 million.
Note, a similar RSEM outputted gene expression file exists for the 'gene' level expression data. For example:
% head wt37_rep1_rsem/RSEM.genes.results | column -t
.
gene_id transcript_id(s) length effective_length expected_count TPM FPKM
TRINITY_DN0_c0_g1 TRINITY_DN0_c0_g1_i1 2010.00 1881.38 225.00 72.01 65.65
TRINITY_DN0_c0_g2 TRINITY_DN0_c0_g2_i1 2011.00 1882.38 0.00 0.00 0.00
TRINITY_DN1000_c0_g1 TRINITY_DN1000_c0_g1_i1 1664.00 1535.38 208.00 81.57 74.37
TRINITY_DN1001_c0_g1 TRINITY_DN1001_c0_g1_i1 2948.00 2819.38 8.00 1.71 1.56
TRINITY_DN1002_c0_g1 TRINITY_DN1002_c0_g1_i1 351.00 222.54 1.00 2.71 2.47
TRINITY_DN1003_c0_g1 TRINITY_DN1003_c0_g1_i1 1913.00 1784.38 91.00 30.71 28.00
TRINITY_DN1004_c0_g1 TRINITY_DN1004_c0_g1_i1 2395.00 2266.38 167.00 44.37 40.45
TRINITY_DN1005_c0_g1 TRINITY_DN1005_c0_g1_i1 3076.00 2947.38 425.00 86.82 79.16
Transcript expression quantitation using Salmon
Salmon is a newly developed tool for doing rna-seq based quantitation of transcript expression. It is both very accurate and very fast. Try running salmon on this same set of reads for the sake of comparison.
% $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
--left wt37_rep1_1.fastq \
--right wt37_rep1_2.fastq \
--transcripts Trinity.fasta \
--est_method salmon \
--trinity_mode \
--output_dir wt37_rep1_salmon \
--prep_reference
In the case of salmon, you'll find the transcript expression results in the file 'wt37_rep1_salmon/quant.sf'. View it like so:
% head wt37_rep1_salmon/quant.sf | column -t
.
Name Length EffectiveLength TPM NumReads
TRINITY_DN144_c0_g1_i1 4833 4705.077 16.169313 137.999745
TRINITY_DN144_c0_g2_i1 2228 2100.077 0.000067 0.000255
TRINITY_DN179_c0_g1_i1 1524 1396.077 89.638905 227.000000
TRINITY_DN159_c0_g1_i1 659 531.133 8.059984 7.765282
TRINITY_DN159_c0_g2_i1 247 120.742 1.071681 0.234718
TRINITY_DN153_c0_g1_i1 2378 2250.077 3.920150 16.000000
TRINITY_DN130_c0_g1_i1 215 89.925 4757.290460 776.000000
TRINITY_DN130_c1_g1_i1 295 168.036 708.651159 216.000000
TRINITY_DN106_c0_g1_i1 4442 4314.077 49.837557 390.000000
and similarly, we have a file containing the 'gene' level expression estimates:
% head wt37_rep1_salmon/quant.sf.genes | column -t
.
Name Length EffectiveLength TPM NumReads
TRINITY_DN4769_c0_g1 621.00 493.14 4.47 4.00
TRINITY_DN4325_c0_g1 1158.00 1030.08 106.50 199.00
TRINITY_DN353_c0_g1 533.00 405.18 10.88 8.00
TRINITY_DN6103_c0_g1 469.00 341.24 6.46 4.00
TRINITY_DN5834_c0_g1 221.00 95.62 28.83 5.00
TRINITY_DN5012_c0_g1 1090.00 962.08 12.61 22.00
TRINITY_DN6811_c0_g1 352.00 224.63 4.91 2.00
TRINITY_DN2924_c0_g1 956.00 828.10 21.30 32.00
TRINITY_DN5980_c0_g1 220.00 94.67 0.00 0.00
Run salmon on each of the remaining pairs of samples.
Running this on all the samples can be monotonous, and with many more samples, advanced users would generally write a short script to fully automate this process. We won't be scripting here, but instead just directly execute abundance estimation just as we did above but on the other five pairs of fastq files. We'll examine the outputs of each in turn, as a sanity check and also just for fun.
Abundance estimates for wt37_rep2:
% $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
--left wt37_rep2_1.fastq \
--right wt37_rep2_2.fastq \
--transcripts Trinity.fasta \
--est_method salmon \
--trinity_mode \
--output_dir wt37_rep2_salmon
Abundance estimates for wt37_rep3:
% $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
--left wt37_rep3_1.fastq \
--right wt37_rep3_2.fastq \
--transcripts Trinity.fasta \
--est_method salmon \
--trinity_mode \
--output_dir wt37_rep3_salmon
Now we're done with processing the wt37 biological replicates, and we'll proceed to now run abundance estimations for the GSNO samples.
Abundance estimates for GSNO_rep1:
% $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
--left GSNO_rep1_1.fastq \
--right GSNO_rep1_2.fastq \
--transcripts Trinity.fasta \
--est_method salmon \
--trinity_mode \
--output_dir GSNO_rep1_salmon
Abundance estimates for GSNO_rep2:
% $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
--left GSNO_rep2_1.fastq \
--right GSNO_rep2_2.fastq \
--transcripts Trinity.fasta \
--est_method salmon \
--trinity_mode \
--output_dir GSNO_rep2_salmon
Abundance estimates for GSNO_rep3:
% $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
--left GSNO_rep3_1.fastq \
--right GSNO_rep3_2.fastq \
--transcripts Trinity.fasta \
--est_method salmon \
--trinity_mode \
--output_dir GSNO_rep3_salmon
And now let's tackle abundance estimation for each of the ph8 sample replicates.
Abundance estimates for ph8_rep1:
% $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
--left ph8_rep1_1.fastq \
--right ph8_rep1_2.fastq \
--transcripts Trinity.fasta \
--est_method salmon \
--trinity_mode \
--output_dir ph8_rep1_salmon
Abundance estimates for ph8_rep2:
% $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
--left ph8_rep2_1.fastq \
--right ph8_rep2_2.fastq \
--transcripts Trinity.fasta \
--est_method salmon \
--trinity_mode \
--output_dir ph8_rep2_salmon
Abundance estimates for ph8_rep3:
% $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
--left ph8_rep3_1.fastq \
--right ph8_rep3_2.fastq \
--transcripts Trinity.fasta \
--est_method salmon \
--trinity_mode \
--output_dir ph8_rep3_salmon
For each of these abundance estimation runs, you should find the corresponding output directories and abundance estimation tables within them.
Construct an expression matrix
Now, given the expression estimates for each of the transcripts in each of the samples, we're going to pull together expression values into matrices containing transcript IDs in the rows, and sample replicate names in the columns. We'll make two matrices, one containing the estimated counts, and another containing the TPM expression values that are cross-sample normalized using the TMM method. This is all done for you by the following script in Trinity, indicating the method we used for expression estimation and providing the list of individual sample abundance estimate files:
% $TRINITY_HOME/util/abundance_estimates_to_matrix.pl \
--est_method salmon --out_prefix Trinity_trans \
--name_sample_by_basedir \
wt37_rep1_salmon/quant.sf \
wt37_rep2_salmon/quant.sf \
wt37_rep3_salmon/quant.sf \
GSNO_rep1_salmon/quant.sf \
GSNO_rep2_salmon/quant.sf \
GSNO_rep3_salmon/quant.sf \
ph8_rep1_salmon/quant.sf \
ph8_rep2_salmon/quant.sf \
ph8_rep3_salmon/quant.sf \
--gene_trans_map Trinity.fasta.gene_trans_map
You should find a matrix file called 'Trinity_trans.counts.matrix', which contains the counts of RNA-Seq fragments mapped to each transcript. Examine the first few lines of the counts matrix:
% head -n20 Trinity_trans.isoform.counts.matrix | column -t
.
wt37_rep1_salmon wt37_rep2_salmon wt37_rep3_salmon GSNO_rep1_salmon GSNO_rep2_salmon GSNO_rep3_salmon ph8_rep1_salmon ph8_rep2_salmon ph8_rep3_salmon
TRINITY_DN89_c0_g2_i1 35.000000 52.000000 42.000000 61.250532 52.762571 55.000000 171.000000 170.000000 176.000000
TRINITY_DN786_c0_g1_i1 1.000000 0.000000 1.000000 2.000000 10.000000 2.000000 7.000000 5.000000 4.000000
TRINITY_DN1457_c0_g1_i1 1.000000 1.000000 2.000000 10.000000 5.000000 4.000000 9.000000 10.000000 13.000000
TRINITY_DN338_c1_g2_i1 11.248017 21.472882 12.603456 10.179001 12.602475 2.404524 3.777531 8.650411 9.468705
TRINITY_DN6706_c0_g1_i1 121.000000 123.000000 142.000000 443.000000 422.000000 343.000000 287.000000 262.000000 259.000000
TRINITY_DN317_c0_g2_i1 51.820365 44.761796 47.121426 66.876764 68.170014 56.500275 194.319409 227.569040 213.340652
TRINITY_DN2912_c0_g1_i1 121.000000 125.000000 145.000000 73.000000 93.000000 58.000000 92.000000 91.000000 78.000000
TRINITY_DN750_c0_g1_i1 0.000000 0.000000 0.000000 2.000000 4.000000 2.000000 5.000000 4.000000 9.000000
TRINITY_DN3879_c0_g1_i1 273.000000 258.000000 333.000000 336.000000 378.000000 278.000000 282.000000 244.000000 257.000000
TRINITY_DN6580_c0_g1_i1 58.000000 46.000000 62.000000 29.000000 38.000000 25.000000 86.000000 86.000000 76.000000
TRINITY_DN5396_c0_g1_i1 7.000000 9.000000 12.000000 8.000000 1.000000 5.000000 35.000000 32.000000 41.000000
TRINITY_DN96_c0_g1_i1 0.000000 42.984681 52.016603 0.000004 0.000000 0.000259 0.000000 0.000000 22.004185
TRINITY_DN416_c0_g1_i5 170.954401 195.661458 70.769534 327.536098 581.461305 1070.337243 76.769551 136.488727 250.761484
TRINITY_DN41_c0_g2_i1 75.134168 68.055046 138.071920 166.617039 86.588482 0.031573 138.550457 208.605443 96.913305
TRINITY_DN4175_c0_g1_i1 7.000000 11.000000 17.000000 242.000000 238.000000 191.000000 256.000000 223.000000 220.000000
TRINITY_DN2677_c0_g1_i1 3.000000 3.000000 4.000000 2.000000 2.000000 0.000000 2.000000 3.000000 1.000000
TRINITY_DN6491_c0_g1_i1 31.000000 39.000000 45.000000 38.000000 44.000000 29.000000 14.000000 11.000000 20.000000
TRINITY_DN631_c0_g1_i1 363.000000 338.000000 347.000000 191.000000 206.000000 148.000000 342.000000 334.000000 311.000000
TRINITY_DN4933_c0_g1_i1 10.000000 6.000000 6.000000 4.000000 8.000000 6.000000 6.000000 4.000000 6.000000
You'll see that the above matrix has integer values representing the number of RNA-Seq paired-end fragments that are estimated to have been derived from that corresponding transcript in each of the samples. Don't be surprised if you see some values that are not exact integers but rather fractional read counts. This happens if there are multiply-mapped reads (such as to common sequence regions of different isoforms), in which case the multiply-mapped reads are fractionally assigned to the corresponding transcripts according to their maximum likelihood.
The counts matrix will be used by edgeR (or other tools in Bioconductor, eg. DESeq2) for statistical analysis and identifying significantly differentially expressed transcripts.
Now take a look at the first few lines of the normalized expression matrix:
% head -n20 Trinity_trans.isoform.TMM.EXPR.matrix | column -t
.
wt37_rep1_salmon wt37_rep2_salmon wt37_rep3_salmon GSNO_rep1_salmon GSNO_rep2_salmon GSNO_rep3_salmon ph8_rep1_salmon ph8_rep2_salmon ph8_rep3_salmon
TRINITY_DN89_c0_g2_i1 6.669 10.264 7.305 12.452 10.531 13.829 27.449 28.930 30.136
TRINITY_DN786_c0_g1_i1 1.537 0.000 1.522 3.335 16.241 4.067 10.259 7.443 6.056
TRINITY_DN1457_c0_g1_i1 0.937 0.973 1.790 10.091 4.932 4.954 7.632 8.750 11.517
TRINITY_DN338_c1_g2_i1 11.499 22.784 12.368 11.215 13.565 3.248 3.522 8.298 9.203
TRINITY_DN6706_c0_g1_i1 31.040 32.694 33.384 121.349 113.442 116.128 62.407 60.259 59.969
TRINITY_DN317_c0_g2_i1 16.058 14.375 13.425 22.144 22.144 23.110 51.298 63.423 59.883
TRINITY_DN2912_c0_g1_i1 31.024 33.210 34.072 19.987 24.988 19.627 19.995 20.919 18.051
TRINITY_DN750_c0_g1_i1 0.000 0.000 0.000 3.703 7.208 4.508 8.221 6.665 15.264
TRINITY_DN3879_c0_g1_i1 111.705 109.433 126.016 147.076 162.230 150.165 99.222 90.313 95.884
TRINITY_DN6580_c0_g1_i1 11.881 9.762 11.604 6.339 8.154 6.758 14.861 15.747 14.004
TRINITY_DN5396_c0_g1_i1 4.458 5.945 7.165 5.465 0.669 4.205 19.589 18.684 24.176
TRINITY_DN96_c0_g1_i1 0.000 12.244 13.119 0.000 0.000 0.000 0.000 0.000 5.466
TRINITY_DN416_c0_g1_i5 180.831 214.813 72.013 373.536 647.696 1496.133 74.311 135.756 252.805
TRINITY_DN41_c0_g2_i1 9.261 8.689 15.475 21.896 11.176 0.005 14.299 22.878 10.688
TRINITY_DN4175_c0_g1_i1 2.452 3.994 5.488 90.622 87.414 88.316 76.684 70.419 69.992
TRINITY_DN2677_c0_g1_i1 9.775 10.047 13.829 7.225 6.977 0.000 6.739 10.237 3.479
TRINITY_DN6491_c0_g1_i1 44.495 58.088 63.561 59.080 66.670 55.042 19.004 15.194 28.080
TRINITY_DN631_c0_g1_i1 56.482 54.479 49.188 31.697 33.569 30.388 44.686 46.323 43.388
TRINITY_DN4933_c0_g1_i1 16.602 10.330 9.927 7.214 14.042 13.177 9.584 6.478 9.888
These are the normalized expression values, which have been further cross-sample normalized using TMM normalization to adjust for any differences in sample composition. TMM normalization assumes that most transcripts are not differentially expressed, and linearly scales the expression values of samples to better enforce this property. TMM normalization is described in A scaling normalization method for differential expression analysis of RNA-Seq data, Robinson and Oshlack, Genome Biology 2010.
We use the TMM-normalized expression matrix when plotting expression values in heatmaps and other expression analyses.
In addition to the transcript count matrix we created earlier, we also have 'gene' level counts and expression matrices:
Trinity_trans.gene.counts.matrix
Trinity_trans.gene.TMM.EXPR.matrix