Comparison to the reference transcriptome - aechchiki/SIB_LongReadsWorkshop_Zurich17 GitHub Wiki

Introduction

Now that you have finalized your assemblies for PacBio and MinION, we can go check the actual content of these files and get an overview of what are the transcripts that we have assembled. For this, we propose two methods here: (1) aligning the assemblies to the reference genome in splice-aware mode and (2) compare these alignments to the reference transcriptome.

As for genome assembly, there are some metrics allowing to assess the quality of your transcriptome, but we're not going in the details here today. As for genome assembly, these metrics can be partitioned into two categories: (1) continuity metrics (e.g., the E90N50 transcript contig length, aka the contig N50 value based on the set of transcripts representing 90% of the expression data) and (2) completeness metrics (e.g., BUSCO to explore completeness according to conserved ortholog content).

Align the transcriptome to the reference genome

We will use GMAP as splice-aware aligner suitable for long reads to align the experimental datasets to the reference genome. A splice-aware aligner allows to see the positions to the reference genome directly by mapping RNA-seq reads to the reference genome, not only the transcriptome. This allows to, for example, detect new isoforms other than the annotated (constituting the transcriptome).

The reason why we chose this aligner are mainly two: (1) GMAP appears to be one of the best according to our benchmark work taking as parameters the accuracy and the speed, and (2) allows to directly save the output in GFF format, which is very handy for comparison to the reference annotation (next section).

First, we need to download the reference genome from Ensembl (you can also use the assembly you built in the genome assembly part):

cd $reference
wget ftp://ftp.ensemblgenomes.org/pub/metazoa/release-36/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.dna.chromosome.4.fa.gz
gunzip -d Drosophila_melanogaster.BDGP6.dna.chromosome.4.fa.gz
mv Drosophila_melanogaster.BDGP6.dna.chromosome.4.fa Dmel_chr4.fasta

Then, we need to build the GMAP hash index on the reference genome. This will create a subfolder containing the index for your genome within the folder from which you execute the commands:

gmap_build -d <gmap_index_name> -D <path/to/genome/> Dmel_chr4.fasta
# -d: string for the index name
# -D: path to the genome to be indexed

Finally, we can run the alignment:

gmap -d <gmap_index_name> -D </path/to/genome/> <input_reads> -f gff3_match_cdna > <gmap_output>.gff3
# -d: string for the index name
# -D: path to the genome to be indexed
# -f: specifies the output format, in our case gff3

You shall repeat this on both the fastq files from PacBio Iso-Seq and the MinION corrected with Canu.

Compare the experiemental transcripts to the reference annotation

We will use Cuffcompare to compare our set of experimental transcripts to the reference annotation. This is one software among others to compare transcript annotation files. Other possibilities include: PASA pipeline (Program to Assemble Spliced Alignments) and TACO (Multi-sample transcriptome assembly from RNA-Seq).

If you would like to know more about this topic, you can check this Bioinformatics SE thread. You are very welcome to contribute if any other software comes to your mind.

First, we need to download the reference annotation from Ensembl:

wget ftp://ftp.ensemblgenomes.org/pub/metazoa/release-36/gff3/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.36.chromosome.4.gff3.gz
gunzip -d Drosophila_melanogaster.BDGP6.36.chromosome.4.gff3.gz
mv Drosophila_melanogaster.BDGP6.36.chromosome.4.gff3 Dmel_chr4.gff3

And convert it to GTF using gffread:

gffread Dmel_chr4.gff3 -T -o Dmel_chr4.gtf

Now, we have to convert the experimental GFF3 file we got in the previous section to a GTF format. To do this, we prepared a parsing script to make the annotation file compatible with the cuffcompare software:

wget https://drive.switch.ch/index.php/s/ySKNwPmD16GuOQ0/download -O gmap_gff2gtf.py

To run this script, simply make it executable and call it on your experimental annotation file:

chmod +x gmap_gff2gtf.py
./gmap_gff2gtf.py <gmap_output>.gff3 > <gmap_output>.gtf

And now you're ready to compare your annotation to the reference set:

cuffcompare -G -r <reference_gtf>.gtf <gmap_output>.gtf
# -G: tells cuffcompare that the input experimental GTF file might not come from Cufflinks
# -r: specifies the reference file

The files you're interested in are written to cuffcmp.*. Go check on them! This page might be useful to check and understand the output.