expression_quantification - trinityrnaseq/BerlinTrinityWorkshop2018 GitHub Wiki

Transcript expression quantitation

To estimate the expression levels of the Trinity-reconstructed transcripts, we use the strategy supported by the RSEM software involving read alignment followed by expectation maximization to assign reads according to maximum likelihood estimates. In essence, we first align the original rna-seq reads back against the Trinity transcripts, then run RSEM to estimate the number of rna-seq fragments that map to each contig. Because the abundance of individual transcripts may significantly differ between samples, the reads from each sample (and each biological replicate) must be examined separately, obtaining sample-specific abundance values.

Abundance estimation using RSEM

We include a script to facilitate running of RSEM on Trinity transcript assemblies. The script we execute below will run the Bowtie aligner to align reads to the Trinity transcripts, and RSEM will then evaluate those alignments to estimate expression values. Again, we need to run this separately for each sample and biological replicate (ie. each pair of fastq files).

Let's start with one of the wt37 fastq pairs like so:

 %  $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
         --left wt37_rep1_1.fastq  \
         --right wt37_rep1_2.fastq  \
         --transcripts Trinity.fasta  \
         --est_method RSEM  --aln_method bowtie \
         --trinity_mode \
         --output_dir wt37_rep1_rsem \
         --prep_reference

The outputs generated from running the command above will exist in the wt37_rep1_rsem/ directory, as we indicate with the --output_dir parameter above. Since this is the first time we're running RSEM, we need to prepare the Trinity.fasta file for alignment with the bowtie software, and so we'll simply include the --prep_reference parameter to invoke that functionality.

The primary output generated by RSEM is the file containing the expression values for each of the transcripts. Examine this file like so:

 % head wt37_rep1_rsem/RSEM.isoforms.results | column -t

and you should see the top of a tab-delimited file:

transcript_id            gene_id               length  effective_length  expected_count  TPM     FPKM    IsoPct
TRINITY_DN0_c0_g1_i1     TRINITY_DN0_c0_g1     2010    1881.38           225.00          72.01   65.65   100.00
TRINITY_DN0_c0_g2_i1     TRINITY_DN0_c0_g2     2011    1882.38           0.00            0.00    0.00    100.00
TRINITY_DN1000_c0_g1_i1  TRINITY_DN1000_c0_g1  1664    1535.38           208.00          81.57   74.37   100.00
TRINITY_DN1001_c0_g1_i1  TRINITY_DN1001_c0_g1  2948    2819.38           8.00            1.71    1.56    100.00
TRINITY_DN1002_c0_g1_i1  TRINITY_DN1002_c0_g1  351     222.54            1.00            2.71    2.47    100.00
TRINITY_DN1003_c0_g1_i1  TRINITY_DN1003_c0_g1  1913    1784.38           91.00           30.71   28.00   100.00
TRINITY_DN1004_c0_g1_i1  TRINITY_DN1004_c0_g1  2395    2266.38           167.00          44.37   40.45   100.00
TRINITY_DN1005_c0_g1_i1  TRINITY_DN1005_c0_g1  3076    2947.38           425.00          86.82   79.16   100.00
TRINITY_DN1006_c0_g1_i1  TRINITY_DN1006_c0_g1  1141    1012.38           253.00          150.47  137.19  100.00

The key columns in the above RSEM output are the transcript identifier, the 'expected_count' corresponding to the number of RNA-Seq fragments predicted to be derived from that transcript, and the 'TPM' or 'FPKM' columns, which provide normalized expression values for the expression of that transcript in the sample.

The FPKM expression measurement normalizes read counts according to the length of transcripts from which they are derived (as longer transcripts generate more reads at the same expression level), and normalized according to sequencing depth. The FPKM acronym stand for 'fragments per kilobase of cDNA per million fragments mapped'.

TPM 'transcripts per million' is generally the favored metric, as all TPM values should sum to 1 million, and TPM nicely reflects the relative molar concentration of that transcript in the sample. FPKM values, on the other hand, do not always sum to the same value, and do not have the similar property of inherrently representing a proportion within a sample, making comparisons between samples less straightforward. TPM values can be easily computed from FPKM values like so: TPMi = FPKMi / (sum all FPKM values) * 1 million.

Note, a similar RSEM outputted gene expression file exists for the 'gene' level expression data. For example:

%  head wt37_rep1_rsem/RSEM.genes.results | column -t

gene_id               transcript_id(s)         length   effective_length  expected_count  TPM     FPKM
TRINITY_DN0_c0_g1     TRINITY_DN0_c0_g1_i1     2010.00  1881.38           225.00          72.01   65.65
TRINITY_DN0_c0_g2     TRINITY_DN0_c0_g2_i1     2011.00  1882.38           0.00            0.00    0.00
TRINITY_DN1000_c0_g1  TRINITY_DN1000_c0_g1_i1  1664.00  1535.38           208.00          81.57   74.37
TRINITY_DN1001_c0_g1  TRINITY_DN1001_c0_g1_i1  2948.00  2819.38           8.00            1.71    1.56
TRINITY_DN1002_c0_g1  TRINITY_DN1002_c0_g1_i1  351.00   222.54            1.00            2.71    2.47
TRINITY_DN1003_c0_g1  TRINITY_DN1003_c0_g1_i1  1913.00  1784.38           91.00           30.71   28.00
TRINITY_DN1004_c0_g1  TRINITY_DN1004_c0_g1_i1  2395.00  2266.38           167.00          44.37   40.45
TRINITY_DN1005_c0_g1  TRINITY_DN1005_c0_g1_i1  3076.00  2947.38           425.00          86.82   79.16

Transcript expression quantitation using Salmon

Salmon is a newly developed tool for doing rna-seq based quantitation of transcript expression. It is both very accurate and very fast. Try running salmon on this same set of reads for the sake of comparison.

 %  $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
         --left wt37_rep1_1.fastq  \
         --right wt37_rep1_2.fastq  \
         --transcripts Trinity.fasta  \
         --est_method salmon \
         --trinity_mode  \
         --output_dir wt37_rep1_salmon \
         --prep_reference

In the case of salmon, you'll find the transcript expression results in the file 'wt37_rep1_salmon/quant.sf'. View it like so:

% head wt37_rep1_salmon/quant.sf | column -t

Name                    Length  EffectiveLength  TPM          NumReads
TRINITY_DN144_c0_g1_i1  4833    4705.077         16.169313    137.999745
TRINITY_DN144_c0_g2_i1  2228    2100.077         0.000067     0.000255
TRINITY_DN179_c0_g1_i1  1524    1396.077         89.638905    227.000000
TRINITY_DN159_c0_g1_i1  659     531.133          8.059984     7.765282
TRINITY_DN159_c0_g2_i1  247     120.742          1.071681     0.234718
TRINITY_DN153_c0_g1_i1  2378    2250.077         3.920150     16.000000 
TRINITY_DN130_c0_g1_i1  215     89.925           4757.290460  776.000000
TRINITY_DN130_c1_g1_i1  295     168.036          708.651159   216.000000
TRINITY_DN106_c0_g1_i1  4442    4314.077         49.837557    390.000000

and similarly, we have a file containing the 'gene' level expression estimates:

% head wt37_rep1_salmon/quant.sf.genes | column -t

Name                  Length   EffectiveLength  TPM     NumReads
TRINITY_DN4769_c0_g1  621.00   493.14           4.47    4.00
TRINITY_DN4325_c0_g1  1158.00  1030.08          106.50  199.00
TRINITY_DN353_c0_g1   533.00   405.18           10.88   8.00
TRINITY_DN6103_c0_g1  469.00   341.24           6.46    4.00
TRINITY_DN5834_c0_g1  221.00   95.62            28.83   5.00
TRINITY_DN5012_c0_g1  1090.00  962.08           12.61   22.00
TRINITY_DN6811_c0_g1  352.00   224.63           4.91    2.00
TRINITY_DN2924_c0_g1  956.00   828.10           21.30   32.00
TRINITY_DN5980_c0_g1  220.00   94.67            0.00    0.00

Run salmon on each of the remaining pairs of samples.

Running this on all the samples can be monotonous, and with many more samples, advanced users would generally write a short script to fully automate this process. We won't be scripting here, but instead just directly execute abundance estimation just as we did above but on the other five pairs of fastq files. We'll examine the outputs of each in turn, as a sanity check and also just for fun.

Abundance estimates for wt37_rep2:

  %  $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
         --left wt37_rep2_1.fastq  \
         --right wt37_rep2_2.fastq  \
         --transcripts Trinity.fasta  \
         --est_method salmon \
         --trinity_mode  \
         --output_dir wt37_rep2_salmon

Abundance estimates for wt37_rep3:

 %  $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
         --left wt37_rep3_1.fastq  \
         --right wt37_rep3_2.fastq  \
         --transcripts Trinity.fasta  \
         --est_method salmon \
         --trinity_mode  \
         --output_dir wt37_rep3_salmon

Now we're done with processing the wt37 biological replicates, and we'll proceed to now run abundance estimations for the GSNO samples.

Abundance estimates for GSNO_rep1:

%  $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
         --left GSNO_rep1_1.fastq  \
         --right GSNO_rep1_2.fastq  \
         --transcripts Trinity.fasta  \
         --est_method salmon \
         --trinity_mode  \
         --output_dir GSNO_rep1_salmon

Abundance estimates for GSNO_rep2:

%  $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
         --left GSNO_rep2_1.fastq  \
         --right GSNO_rep2_2.fastq  \
         --transcripts Trinity.fasta  \
         --est_method salmon \
         --trinity_mode  \
         --output_dir GSNO_rep2_salmon

Abundance estimates for GSNO_rep3:

%  $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
         --left GSNO_rep3_1.fastq  \
         --right GSNO_rep3_2.fastq  \
         --transcripts Trinity.fasta  \
         --est_method salmon \
         --trinity_mode  \
         --output_dir GSNO_rep3_salmon

And now let's tackle abundance estimation for each of the ph8 sample replicates.

Abundance estimates for ph8_rep1:

%  $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
         --left ph8_rep1_1.fastq  \
         --right ph8_rep1_2.fastq  \
         --transcripts Trinity.fasta  \
         --est_method salmon \
         --trinity_mode  \
         --output_dir ph8_rep1_salmon

Abundance estimates for ph8_rep2:

%  $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
         --left ph8_rep2_1.fastq  \
         --right ph8_rep2_2.fastq  \
         --transcripts Trinity.fasta  \
         --est_method salmon \
         --trinity_mode  \
         --output_dir ph8_rep2_salmon

Abundance estimates for ph8_rep3:

%  $TRINITY_HOME/util/align_and_estimate_abundance.pl --seqType fq \
         --left ph8_rep3_1.fastq  \
         --right ph8_rep3_2.fastq  \
         --transcripts Trinity.fasta  \
         --est_method salmon \
         --trinity_mode  \
         --output_dir ph8_rep3_salmon

For each of these abundance estimation runs, you should find the corresponding output directories and abundance estimation tables within them.

Construct an expression matrix

Now, given the expression estimates for each of the transcripts in each of the samples, we're going to pull together expression values into matrices containing transcript IDs in the rows, and sample replicate names in the columns. We'll make two matrices, one containing the estimated counts, and another containing the TPM expression values that are cross-sample normalized using the TMM method. This is all done for you by the following script in Trinity, indicating the method we used for expression estimation and providing the list of individual sample abundance estimate files:

% $TRINITY_HOME/util/abundance_estimates_to_matrix.pl \
    --est_method salmon  --out_prefix Trinity_trans \
    --name_sample_by_basedir \
    wt37_rep1_salmon/quant.sf \
    wt37_rep2_salmon/quant.sf \
    wt37_rep3_salmon/quant.sf \
    GSNO_rep1_salmon/quant.sf \
    GSNO_rep2_salmon/quant.sf \
    GSNO_rep3_salmon/quant.sf \
    ph8_rep1_salmon/quant.sf \
    ph8_rep2_salmon/quant.sf \
    ph8_rep3_salmon/quant.sf \
    --gene_trans_map Trinity.fasta.gene_trans_map

You should find a matrix file called 'Trinity_trans.counts.matrix', which contains the counts of RNA-Seq fragments mapped to each transcript. Examine the first few lines of the counts matrix:

% head -n20 Trinity_trans.isoform.counts.matrix | column -t

wt37_rep1_salmon         wt37_rep2_salmon  wt37_rep3_salmon  GSNO_rep1_salmon  GSNO_rep2_salmon  GSNO_rep3_salmon  ph8_rep1_salmon  ph8_rep2_salmon  ph8_rep3_salmon
TRINITY_DN89_c0_g2_i1    35.000000         52.000000         42.000000         61.250532         52.762571         55.000000        171.000000       170.000000       176.000000
TRINITY_DN786_c0_g1_i1   1.000000          0.000000          1.000000          2.000000          10.000000         2.000000         7.000000         5.000000         4.000000
TRINITY_DN1457_c0_g1_i1  1.000000          1.000000          2.000000          10.000000         5.000000          4.000000         9.000000         10.000000        13.000000
TRINITY_DN338_c1_g2_i1   11.248017         21.472882         12.603456         10.179001         12.602475         2.404524         3.777531         8.650411         9.468705
TRINITY_DN6706_c0_g1_i1  121.000000        123.000000        142.000000        443.000000        422.000000        343.000000       287.000000       262.000000       259.000000
TRINITY_DN317_c0_g2_i1   51.820365         44.761796         47.121426         66.876764         68.170014         56.500275        194.319409       227.569040       213.340652
TRINITY_DN2912_c0_g1_i1  121.000000        125.000000        145.000000        73.000000         93.000000         58.000000        92.000000        91.000000        78.000000
TRINITY_DN750_c0_g1_i1   0.000000          0.000000          0.000000          2.000000          4.000000          2.000000         5.000000         4.000000         9.000000
TRINITY_DN3879_c0_g1_i1  273.000000        258.000000        333.000000        336.000000        378.000000        278.000000       282.000000       244.000000       257.000000
TRINITY_DN6580_c0_g1_i1  58.000000         46.000000         62.000000         29.000000         38.000000         25.000000        86.000000        86.000000        76.000000
TRINITY_DN5396_c0_g1_i1  7.000000          9.000000          12.000000         8.000000          1.000000          5.000000         35.000000        32.000000        41.000000
TRINITY_DN96_c0_g1_i1    0.000000          42.984681         52.016603         0.000004          0.000000          0.000259         0.000000         0.000000         22.004185
TRINITY_DN416_c0_g1_i5   170.954401        195.661458        70.769534         327.536098        581.461305        1070.337243      76.769551        136.488727       250.761484
TRINITY_DN41_c0_g2_i1    75.134168         68.055046         138.071920        166.617039        86.588482         0.031573         138.550457       208.605443       96.913305
TRINITY_DN4175_c0_g1_i1  7.000000          11.000000         17.000000         242.000000        238.000000        191.000000       256.000000       223.000000       220.000000
TRINITY_DN2677_c0_g1_i1  3.000000          3.000000          4.000000          2.000000          2.000000          0.000000         2.000000         3.000000         1.000000
TRINITY_DN6491_c0_g1_i1  31.000000         39.000000         45.000000         38.000000         44.000000         29.000000        14.000000        11.000000        20.000000
TRINITY_DN631_c0_g1_i1   363.000000        338.000000        347.000000        191.000000        206.000000        148.000000       342.000000       334.000000       311.000000
TRINITY_DN4933_c0_g1_i1  10.000000         6.000000          6.000000          4.000000          8.000000          6.000000         6.000000         4.000000         6.000000

You'll see that the above matrix has integer values representing the number of RNA-Seq paired-end fragments that are estimated to have been derived from that corresponding transcript in each of the samples. Don't be surprised if you see some values that are not exact integers but rather fractional read counts. This happens if there are multiply-mapped reads (such as to common sequence regions of different isoforms), in which case the multiply-mapped reads are fractionally assigned to the corresponding transcripts according to their maximum likelihood.

The counts matrix will be used by edgeR (or other tools in Bioconductor, eg. DESeq2) for statistical analysis and identifying significantly differentially expressed transcripts.

Now take a look at the first few lines of the normalized expression matrix:

%  head -n20 Trinity_trans.isoform.TMM.EXPR.matrix | column -t

wt37_rep1_salmon         wt37_rep2_salmon  wt37_rep3_salmon  GSNO_rep1_salmon  GSNO_rep2_salmon  GSNO_rep3_salmon  ph8_rep1_salmon  ph8_rep2_salmon  ph8_rep3_salmon
TRINITY_DN89_c0_g2_i1    6.669             10.264            7.305             12.452            10.531            13.829           27.449           28.930           30.136
TRINITY_DN786_c0_g1_i1   1.537             0.000             1.522             3.335             16.241            4.067            10.259           7.443            6.056
TRINITY_DN1457_c0_g1_i1  0.937             0.973             1.790             10.091            4.932             4.954            7.632            8.750            11.517
TRINITY_DN338_c1_g2_i1   11.499            22.784            12.368            11.215            13.565            3.248            3.522            8.298            9.203
TRINITY_DN6706_c0_g1_i1  31.040            32.694            33.384            121.349           113.442           116.128          62.407           60.259           59.969
TRINITY_DN317_c0_g2_i1   16.058            14.375            13.425            22.144            22.144            23.110           51.298           63.423           59.883
TRINITY_DN2912_c0_g1_i1  31.024            33.210            34.072            19.987            24.988            19.627           19.995           20.919           18.051
TRINITY_DN750_c0_g1_i1   0.000             0.000             0.000             3.703             7.208             4.508            8.221            6.665            15.264
TRINITY_DN3879_c0_g1_i1  111.705           109.433           126.016           147.076           162.230           150.165          99.222           90.313           95.884
TRINITY_DN6580_c0_g1_i1  11.881            9.762             11.604            6.339             8.154             6.758            14.861           15.747           14.004
TRINITY_DN5396_c0_g1_i1  4.458             5.945             7.165             5.465             0.669             4.205            19.589           18.684           24.176
TRINITY_DN96_c0_g1_i1    0.000             12.244            13.119            0.000             0.000             0.000            0.000            0.000            5.466
TRINITY_DN416_c0_g1_i5   180.831           214.813           72.013            373.536           647.696           1496.133         74.311           135.756          252.805
TRINITY_DN41_c0_g2_i1    9.261             8.689             15.475            21.896            11.176            0.005            14.299           22.878           10.688
TRINITY_DN4175_c0_g1_i1  2.452             3.994             5.488             90.622            87.414            88.316           76.684           70.419           69.992
TRINITY_DN2677_c0_g1_i1  9.775             10.047            13.829            7.225             6.977             0.000            6.739            10.237           3.479
TRINITY_DN6491_c0_g1_i1  44.495            58.088            63.561            59.080            66.670            55.042           19.004           15.194           28.080
TRINITY_DN631_c0_g1_i1   56.482            54.479            49.188            31.697            33.569            30.388           44.686           46.323           43.388
TRINITY_DN4933_c0_g1_i1  16.602            10.330            9.927             7.214             14.042            13.177           9.584            6.478            9.888

These are the normalized expression values, which have been further cross-sample normalized using TMM normalization to adjust for any differences in sample composition. TMM normalization assumes that most transcripts are not differentially expressed, and linearly scales the expression values of samples to better enforce this property. TMM normalization is described in A scaling normalization method for differential expression analysis of RNA-Seq data, Robinson and Oshlack, Genome Biology 2010.

We use the TMM-normalized expression matrix when plotting expression values in heatmaps and other expression analyses.

In addition to the transcript count matrix we created earlier, we also have 'gene' level counts and expression matrices:

Trinity_trans.gene.counts.matrix
Trinity_trans.gene.TMM.EXPR.matrix