assembly_quality_assessment - trinityrnaseq/BerlinTrinityWorkshop2018 GitHub Wiki
Evaluating the assembly
There are several ways to quantitatively as well as qualitatively assess the overall quality of the assembly, and we outline many of these methods at our Trinity wiki.
Assembly Statistics that are NOT very useful
You can count the number of assembled transcripts by using 'grep' to retrieve only the FASTA header lines and piping that output into 'wc' (word count utility) with the '-l' parameter to just count the number of lines.
% grep '>' Trinity.fasta | wc -l
How many were assembled?
It's useful to know how many transcript contigs were assembled, but it's not very informative. The deeper you sequence, the more transcript contigs you will be able to reconstruct. It's not unusual to assemble over a million transcript contigs with very deep data sets and complex transcriptomes, but as you'll see below (in the section containing the more informative guide to assembly assessment) a fraction of the transcripts generally best represent the input RNA-Seq reads.
Examine assembly stats
Capture some basic statistics about the Trinity assembly:
% $TRINITY_HOME/util/TrinityStats.pl Trinity.fasta
.
################################
## Counts of transcripts, etc.
################################
Total trinity 'genes': 7648
Total trinity transcripts: 7719
Percent GC: 38.88
########################################
Stats based on ALL transcript contigs:
########################################
Contig N10: 4318
Contig N20: 3395
Contig N30: 2863
Contig N40: 2466
Contig N50: 2065
Median contig length: 1038
Average contig: 1354.26
Total assembled bases: 10453524
#####################################################
## Stats based on ONLY LONGEST ISOFORM per 'GENE':
#####################################################
Contig N10: 4317
Contig N20: 3375
Contig N30: 2850
Contig N40: 2458
Contig N50: 2060
Median contig length: 1044
Average contig: 1354.49
Total assembled bases: 10359175
The total number of reconstructed transcripts should match up identically to what we counted earlier with our simple 'grep | wc' command. The total number of 'genes' is also reported - and simply involves counting up the number of unique transcript identifier prefixes (without the _i isoform numbers). When the 'gene' and 'transcript' identifiers differ, it's due to transcripts being reported as alternative isoforms for the same gene. In our tiny example data set, we unfortunately do not reconstruct any alternative isoforms, and note that alternative splicing in this yeast species may be fairly rare. Tackling an insect or mammal transcriptome would be expected to yield many alternative isoforms.
You'll also see 'Contig N50' values reported. You'll remmeber from the earlier lectures on genome assembly that the 'N50 statistic indica tes that at least half of the assembled bases are in contigs of at least that contig length'. We extend the N50 statistic to provide N40, N30, etc. statistics with similar meaning. As the N-value is decreased, the corresponding length will increase.
Most of this is not quantitatively useful, and the values are only reported for historical reasons - it's simply what everyone used to do in the early days of transcriptome assembly. The N50 statistic in RNA-Seq assembly can be easily biased in the following ways:
-
Overzealous reconstruction of long alternatively spliced isoforms: If an assembler tends to generate many different 'versions' of splicing for a gene, such as in a combinatorial way, and those isoforms tend to have long sequence lengths, the N50 value will be skewed towards a higher value.
-
Highly sensitive reconstruction of lowly expressed isoforms: If an assembler is able to reconstruct transcript contigs for those transcirpts that are very lowly expressed, these contigs will tend to be short and numerous, biasing the N50 value towards lower values. As one sequences deeper, there will be more evidence (reads) available to enable reconstruction of these short lowly expressed transcripts, and so deeper sequencing can also provide a downward skew of the N50 value.
Assembly statistics that are MORE useful
We now move into the section containing more meaningful metrics for evaluating your transcriptome assembly.
Representation of reads
A high quality transcriptome assembly is expected to have strong representation of the reads input to the assembler. By aligning the RNA-Seq reads back to the transcriptome assembly, we can quantify read representation. In Trinity, we have a script that will count the number of reads that are found as properly paired alignments in addition to those that align to separate contigs (improperly paired), in addition to those cases where only one of the paired reads (the left or the right read of the pair) aligns to a contig. Run the following, first using bowtie2 to generate alignments and then counting the number of reads according to alignment configurations:
Note, for the sake of time and compute resources, we'll run this on the small set of reads we extracted earlier, but normally you would run this on each of your sets of reads.
First build a bowtie2 index for the Trinity assembly:
% bowtie2-build Trinity.fasta Trinity.fasta
Next align the reads to the Trinity assembly and generate a readname-sorted bam file:
% bowtie2 --local --no-unal -x Trinity.fasta -q -1 left.100k.fastq -2 right.100k.fastq \
| samtools view -Sb - | samtools sort -o bowtie2.coordSorted.bam
.
[samopen] SAM header is present: 7719 sequences.
100000 reads; of these:
100000 (100.00%) were paired; of these:
1396 (1.40%) aligned concordantly 0 times
92183 (92.18%) aligned concordantly exactly 1 time
6421 (6.42%) aligned concordantly >1 times
----
1396 pairs aligned concordantly 0 times; of these:
372 (26.65%) aligned discordantly 1 time
----
1024 pairs aligned 0 times concordantly or discordantly; of these:
2048 mates make up the pairs; of these:
1483 (72.41%) aligned 0 times
314 (15.33%) aligned exactly 1 time
251 (12.26%) aligned >1 times
99.26% overall alignment rate
The output file generated above will be a single bam file 'bowtie2.nameSorted.bam', which has the reads sorted according to the read name. Using this file and another script in the Trinity software suite, we can count the number of reads that correspond to the different alignment categories.
Generally, in a high quality assembly, you would expect to see at least ~70% and at least ~70% of the reads to exist as proper pairs. If you find a large number of 'improper_pairs', it generally indicates that both paired reads align to the assembly, but they align to different transcript contigs - which is usually the sign of a fractured assembly. In that case, deeper sequencing and assembly of more reads would be expected to lead to major improvements here.
Another look at assembly quality statistics: ExN50
Although we outline above several of the reasons for why the contig N50 statistic is not a useful metric of assembly quality, below we describe the use of an alternative statistic - the ExN50 value, which we assert is more useful in assessing the quality of the transcriptome assembly. The ExN50 indicates the N50 contig statistic (as earlier) but restricted to the top most highly expressed transcripts. Compute it like so:
% $TRINITY_HOME/util/misc/contig_ExN50_statistic.pl Trinity_trans.isoform.TMM.EXPR.matrix \
Trinity.fasta > ExN50.stats
View the contents of the above output file:
% cat ExN50.stats | column -t
.
Ex ExN50 num_transcripts
3 290 1
5 427 2
6 290 3
7 415 4
8 427 5
9 610 6
10 801 7
11 610 9
12 610 10
...
84 1859 1583
85 1897 1708
86 1920 1846
87 1938 1997
88 1984 2163
89 2002 2347
90 2017 2550
91 2042 2779
92 2081 3037
93 2132 3335
94 2168 3680
95 2219 4091
96 2245 4599
97 2235 5261
98 2152 6173
99 2066 7705
100 2065 7719
The N50 based on all expressed transcript contigs (E100) is 2065, but we find the peak N50 at E96 of 2245. The peak ExN50 can vary considerably depending on the assembly, and it can often be an indicator of the quality of the assembly.
Try plotting the ExN50 statistics:
% $TRINITY_HOME/util/misc/plot_ExN50_statistic.Rscript ExN50.stats
The above command will generate a pdf file 'ExN50.stats.plot.pdf' containing the plot of N50 contig length vs. percent of expression data. View the pdf file within your browser via the apache link.
As you can see, the N50 value will tend to peak at a value higher than that computed using the entire data set. With a high quality transcriptome assembly, the N50 value tends to peak at ~90% or higher amount of the expression data for a high quality assembly. Reporting the ExN50 contig length and corresponding Ex transcript count are more meaningful than reporting statistics based on the entire set of assembled transcripts.
Other examples of ExN50 plots based on assemblies varying the number of input reads are available here:
You can see that as you sequence deeper, you'll end up with an assembly that has an ExN50 peak that approaches the use of ~90% or more of the expression data.
Using IGV to examine read support for assembled transcripts
Every assembled transcript is only as valid as the reads that support it. If you ever want to examine the read support for one of your favorite transcripts, you could do this using the IGV browser.
Try setting up IGV on your machine by visiting the IGV website and clicking on oen of the Java Web Start icons in the middle of the page. For most of you, the 'Launch with 2GB' webstart button should be appropriate.
Let's aim to view the earlier bowtie2-aligned reads to our Trinity assembly within IGV. IGV (like most other tools that view or otherwise analyze bam files) require a coordinate-sorted bam file, like we generated earlier as 'bowtie2.coordSorted.bam'.
First, index the bam file (generating a corresponding .bai file):
% samtools index bowtie2.coordSorted.bam
also index the fasta file (generating a corresponding .fai file) using a different command:
% samtools faidx Trinity.fasta
We'll access the following data files from your apache link for viewing in IGV according to their URLs:
- Trinity.fasta
- Trinity.fasta.fai
- bowtie2.coordSorted.bam
- bowtie2.coordSorted.bam.bai
Note, if you do not find each of the files listed in your apache browser, try refreshing the page to get a fresh view.
Load in the 'Trinity.fasta' file as a 'genome' via the IGV 'Genomes'->'Load Genome from URL' menu. Right-click the name of the file from within your Apache view in order to 'Copy Link Address' and paste it in to the IGV dialog.
Load in the 'bowtie2.coordSorted.bam' file via the IGV 'File'->'Load from URL' menu.
Take some time to familiarize yourself with IGV. Look at a few transcripts and consider the read support. View the reads as pairs to examine the paired-read linkages.
From your earlier expression data (counts matrix), try finding a highly expressed transcript and view alignments to it within IGV.
Assess number of full-length coding transcripts
Another very useful metric in evaluating your assembly is to assess the number of fully reconstructed coding transcripts. This can be done by performing a BLASTX search of your assembled transcript sequences to a high quality database of protein sequences, such as provided by SWISSPROT. Searching a large protein database using BLASTX can take a while - longer than we want during this workshop, so instead, we'll search the mini-version of SWISSPROT that comes installed in our data/ directory:
% blastx -query Trinity.fasta \
-db ~/shared_ro/dbs/sprot.mini.pep -out blastx.outfmt6 \
-evalue 1e-20 -num_threads 2 -max_target_seqs 1 -outfmt 6
The above blastx command will have generated an output file 'blastx.outfmt6', storing only the single best matching protein given the E-value threshold of 1e-20. By running another script in the Trinity suite, we can compute the length representation of best matching SWISSPROT matches like so:
% $TRINITY_HOME/util/analyze_blastPlus_topHit_coverage.pl \
blastx.outfmt6 Trinity.fasta \
~/shared_ro/dbs/sprot.mini.pep | column -t
.
#hit_pct_cov_bin count_in_bin >bin_below
100 2460 2460
90 369 2829
80 224 3053
70 203 3256
60 194 3450
50 149 3599
40 121 3720
30 95 3815
20 81 3896
10 16 3912
The above table lists bins of percent length coverage of the best matching protein sequence along with counts of proteins found within that bin. For example, 2460 proteins are matched by more than 90% of their length up to 100% of their length. There are 369 matched by more than 80% and up to 90% of their length. The third column provides a running total, indicating that 2829 transcripts match more than 80% of their length, and 3053 transcripts match more than 70% of their length, etc.
The count of full-length transcripts is going to be dependent on how good the assembly is in addition to the depth of sequencing, but should saturate at higher levels of sequencing. Performing this full-length transcript analysis using assemblies at different read depths and plotting the number of full-length transcripts as a function of sequencing depth will give you an idea of whether or not you've sequenced deeply enough or you should consider doing more RNA-Seq to capture more transcripts and obtain a better (more complete) assembly.
We'll explore some additional metrics that are useful in assessing the assembly quality below, but they require that we estimate expression values for our transcripts, so we'll tackle that first.