assembly_quality_assessment - trinityrnaseq/BerlinTrinityWorkshop2018 GitHub Wiki

Evaluating the assembly

There are several ways to quantitatively as well as qualitatively assess the overall quality of the assembly, and we outline many of these methods at our Trinity wiki.

Assembly Statistics that are NOT very useful

You can count the number of assembled transcripts by using 'grep' to retrieve only the FASTA header lines and piping that output into 'wc' (word count utility) with the '-l' parameter to just count the number of lines.

% grep '>' Trinity.fasta | wc -l

How many were assembled?

It's useful to know how many transcript contigs were assembled, but it's not very informative. The deeper you sequence, the more transcript contigs you will be able to reconstruct. It's not unusual to assemble over a million transcript contigs with very deep data sets and complex transcriptomes, but as you'll see below (in the section containing the more informative guide to assembly assessment) a fraction of the transcripts generally best represent the input RNA-Seq reads.

Examine assembly stats

Capture some basic statistics about the Trinity assembly:

 % $TRINITY_HOME/util/TrinityStats.pl Trinity.fasta

################################
## Counts of transcripts, etc.
################################
Total trinity 'genes':	7648
Total trinity transcripts:	7719
Percent GC: 38.88

########################################
Stats based on ALL transcript contigs:
########################################

Contig N10: 4318
Contig N20: 3395
Contig N30: 2863
Contig N40: 2466
Contig N50: 2065

Median contig length: 1038
Average contig: 1354.26
Total assembled bases: 10453524


#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################

Contig N10: 4317
Contig N20: 3375
Contig N30: 2850
Contig N40: 2458
Contig N50: 2060

Median contig length: 1044
Average contig: 1354.49
Total assembled bases: 10359175

The total number of reconstructed transcripts should match up identically to what we counted earlier with our simple 'grep | wc' command. The total number of 'genes' is also reported - and simply involves counting up the number of unique transcript identifier prefixes (without the _i isoform numbers). When the 'gene' and 'transcript' identifiers differ, it's due to transcripts being reported as alternative isoforms for the same gene. In our tiny example data set, we unfortunately do not reconstruct any alternative isoforms, and note that alternative splicing in this yeast species may be fairly rare. Tackling an insect or mammal transcriptome would be expected to yield many alternative isoforms.

You'll also see 'Contig N50' values reported. You'll remmeber from the earlier lectures on genome assembly that the 'N50 statistic indica tes that at least half of the assembled bases are in contigs of at least that contig length'. We extend the N50 statistic to provide N40, N30, etc. statistics with similar meaning. As the N-value is decreased, the corresponding length will increase.

Most of this is not quantitatively useful, and the values are only reported for historical reasons - it's simply what everyone used to do in the early days of transcriptome assembly. The N50 statistic in RNA-Seq assembly can be easily biased in the following ways:

Overzealous reconstruction of long alternatively spliced isoforms: If an assembler tends to generate many different 'versions' of splicing for a gene, such as in a combinatorial way, and those isoforms tend to have long sequence lengths, the N50 value will be skewed towards a higher value.
Highly sensitive reconstruction of lowly expressed isoforms: If an assembler is able to reconstruct transcript contigs for those transcirpts that are very lowly expressed, these contigs will tend to be short and numerous, biasing the N50 value towards lower values. As one sequences deeper, there will be more evidence (reads) available to enable reconstruction of these short lowly expressed transcripts, and so deeper sequencing can also provide a downward skew of the N50 value.

Assembly statistics that are MORE useful

We now move into the section containing more meaningful metrics for evaluating your transcriptome assembly.

Representation of reads

A high quality transcriptome assembly is expected to have strong representation of the reads input to the assembler. By aligning the RNA-Seq reads back to the transcriptome assembly, we can quantify read representation. In Trinity, we have a script that will count the number of reads that are found as properly paired alignments in addition to those that align to separate contigs (improperly paired), in addition to those cases where only one of the paired reads (the left or the right read of the pair) aligns to a contig. Run the following, first using bowtie2 to generate alignments and then counting the number of reads according to alignment configurations:

Note, for the sake of time and compute resources, we'll run this on the small set of reads we extracted earlier, but normally you would run this on each of your sets of reads.

First build a bowtie2 index for the Trinity assembly:

% bowtie2-build Trinity.fasta Trinity.fasta

Next align the reads to the Trinity assembly and generate a readname-sorted bam file:

% bowtie2 --local --no-unal -x Trinity.fasta -q -1 left.100k.fastq -2 right.100k.fastq \
        | samtools view -Sb - | samtools sort -o bowtie2.coordSorted.bam

[samopen] SAM header is present: 7719 sequences.
100000 reads; of these:
 100000 (100.00%) were paired; of these:
   1396 (1.40%) aligned concordantly 0 times
  92183 (92.18%) aligned concordantly exactly 1 time
  6421 (6.42%) aligned concordantly >1 times
  ----
  1396 pairs aligned concordantly 0 times; of these:
   372 (26.65%) aligned discordantly 1 time
  ----
  1024 pairs aligned 0 times concordantly or discordantly; of these:
  2048 mates make up the pairs; of these:
    1483 (72.41%) aligned 0 times
    314 (15.33%) aligned exactly 1 time
    251 (12.26%) aligned >1 times
99.26% overall alignment rate

The output file generated above will be a single bam file 'bowtie2.nameSorted.bam', which has the reads sorted according to the read name. Using this file and another script in the Trinity software suite, we can count the number of reads that correspond to the different alignment categories.

Generally, in a high quality assembly, you would expect to see at least ~70% and at least ~70% of the reads to exist as proper pairs. If you find a large number of 'improper_pairs', it generally indicates that both paired reads align to the assembly, but they align to different transcript contigs - which is usually the sign of a fractured assembly. In that case, deeper sequencing and assembly of more reads would be expected to lead to major improvements here.

Another look at assembly quality statistics: ExN50

Although we outline above several of the reasons for why the contig N50 statistic is not a useful metric of assembly quality, below we describe the use of an alternative statistic - the ExN50 value, which we assert is more useful in assessing the quality of the transcriptome assembly. The ExN50 indicates the N50 contig statistic (as earlier) but restricted to the top most highly expressed transcripts. Compute it like so:

% $TRINITY_HOME/util/misc/contig_ExN50_statistic.pl Trinity_trans.isoform.TMM.EXPR.matrix \
     Trinity.fasta > ExN50.stats

View the contents of the above output file:

% cat ExN50.stats  | column -t

Ex   ExN50  num_transcripts
3    290    1
5    427    2
6    290    3
7    415    4
8    427    5
9    610    6
10   801    7
11   610    9
12   610    10
...
84   1859   1583
85   1897   1708
86   1920   1846
87   1938   1997
88   1984   2163
89   2002   2347
90   2017   2550
91   2042   2779
92   2081   3037
93   2132   3335
94   2168   3680
95   2219   4091
96   2245   4599
97   2235   5261
98   2152   6173
99   2066   7705
100  2065   7719

The N50 based on all expressed transcript contigs (E100) is 2065, but we find the peak N50 at E96 of 2245. The peak ExN50 can vary considerably depending on the assembly, and it can often be an indicator of the quality of the assembly.

Try plotting the ExN50 statistics:

% $TRINITY_HOME/util/misc/plot_ExN50_statistic.Rscript ExN50.stats

The above command will generate a pdf file 'ExN50.stats.plot.pdf' containing the plot of N50 contig length vs. percent of expression data. View the pdf file within your browser via the apache link.

As you can see, the N50 value will tend to peak at a value higher than that computed using the entire data set. With a high quality transcriptome assembly, the N50 value tends to peak at ~90% or higher amount of the expression data for a high quality assembly. Reporting the ExN50 contig length and corresponding Ex transcript count are more meaningful than reporting statistics based on the entire set of assembled transcripts.

Other examples of ExN50 plots based on assemblies varying the number of input reads are available here:

https://github.com/trinityrnaseq/trinityrnaseq/wiki/Transcriptome%20Contig%20Nx%20and%20ExN50%20stats

You can see that as you sequence deeper, you'll end up with an assembly that has an ExN50 peak that approaches the use of ~90% or more of the expression data.

Using IGV to examine read support for assembled transcripts

Every assembled transcript is only as valid as the reads that support it. If you ever want to examine the read support for one of your favorite transcripts, you could do this using the IGV browser.

Try setting up IGV on your machine by visiting the IGV website and clicking on oen of the Java Web Start icons in the middle of the page. For most of you, the 'Launch with 2GB' webstart button should be appropriate.

Let's aim to view the earlier bowtie2-aligned reads to our Trinity assembly within IGV. IGV (like most other tools that view or otherwise analyze bam files) require a coordinate-sorted bam file, like we generated earlier as 'bowtie2.coordSorted.bam'.

First, index the bam file (generating a corresponding .bai file):

%  samtools index  bowtie2.coordSorted.bam

also index the fasta file (generating a corresponding .fai file) using a different command:

% samtools faidx Trinity.fasta

We'll access the following data files from your apache link for viewing in IGV according to their URLs:

Trinity.fasta
Trinity.fasta.fai
bowtie2.coordSorted.bam
bowtie2.coordSorted.bam.bai

Note, if you do not find each of the files listed in your apache browser, try refreshing the page to get a fresh view.

Load in the 'Trinity.fasta' file as a 'genome' via the IGV 'Genomes'->'Load Genome from URL' menu. Right-click the name of the file from within your Apache view in order to 'Copy Link Address' and paste it in to the IGV dialog.

Load in the 'bowtie2.coordSorted.bam' file via the IGV 'File'->'Load from URL' menu.

Take some time to familiarize yourself with IGV. Look at a few transcripts and consider the read support. View the reads as pairs to examine the paired-read linkages.

From your earlier expression data (counts matrix), try finding a highly expressed transcript and view alignments to it within IGV.

Assess number of full-length coding transcripts

Another very useful metric in evaluating your assembly is to assess the number of fully reconstructed coding transcripts. This can be done by performing a BLASTX search of your assembled transcript sequences to a high quality database of protein sequences, such as provided by SWISSPROT. Searching a large protein database using BLASTX can take a while - longer than we want during this workshop, so instead, we'll search the mini-version of SWISSPROT that comes installed in our data/ directory:

% blastx -query Trinity.fasta \
         -db ~/shared_ro/dbs/sprot.mini.pep -out blastx.outfmt6 \
         -evalue 1e-20 -num_threads 2 -max_target_seqs 1 -outfmt 6

The above blastx command will have generated an output file 'blastx.outfmt6', storing only the single best matching protein given the E-value threshold of 1e-20. By running another script in the Trinity suite, we can compute the length representation of best matching SWISSPROT matches like so:

% $TRINITY_HOME/util/analyze_blastPlus_topHit_coverage.pl \
             blastx.outfmt6 Trinity.fasta \
             ~/shared_ro/dbs/sprot.mini.pep | column -t

#hit_pct_cov_bin  count_in_bin  >bin_below
100               2460          2460
90                369           2829
80                224           3053
70                203           3256
60                194           3450
50                149           3599
40                121           3720
30                95            3815
20                81            3896
10                16            3912

The above table lists bins of percent length coverage of the best matching protein sequence along with counts of proteins found within that bin. For example, 2460 proteins are matched by more than 90% of their length up to 100% of their length. There are 369 matched by more than 80% and up to 90% of their length. The third column provides a running total, indicating that 2829 transcripts match more than 80% of their length, and 3053 transcripts match more than 70% of their length, etc.

The count of full-length transcripts is going to be dependent on how good the assembly is in addition to the depth of sequencing, but should saturate at higher levels of sequencing. Performing this full-length transcript analysis using assemblies at different read depths and plotting the number of full-length transcripts as a function of sequencing depth will give you an idea of whether or not you've sequenced deeply enough or you should consider doing more RNA-Seq to capture more transcripts and obtain a better (more complete) assembly.

We'll explore some additional metrics that are useful in assessing the assembly quality below, but they require that we estimate expression values for our transcripts, so we'll tackle that first.