TranscriptomeRefBased - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki
Section: Transcriptome assessment [3/4].
We will use Cuffcompare to compare our set of experimental transcripts to the reference annotation. This is one software among others to compare transcript annotation files.
Other possibilities include:
- PASA pipeline (Program to Assemble Spliced Alignments)
- TACO (Multi-sample transcriptome assembly from RNA-Seq).
If you would like to know more about this topic, you can check this Bioinformatics SE thread. You are very welcome to contribute if any other software comes to your mind.
First, we need to download the reference annotation from Ensembl:
wget ftp://ftp.ensemblgenomes.org/pub/metazoa/release-36/gff3/drosophila_melanogaster/Drosophila_melanogaster.BDGP6.36.chromosome.4.gff3.gz
gunzip -d Drosophila_melanogaster.BDGP6.36.chromosome.4.gff3.gz
mv Drosophila_melanogaster.BDGP6.36.chromosome.4.gff3 Dmel_chr4.gff3
And convert those gff to GTF
using gffread:
gffread Dmel_chr4.gff3 -T -o Dmel_chr4.gtf
Convert the experimental gff3 files too! (the ones you generated just before).
Now, we have to convert the experimental GFF3 file we got in the previous section to a GTF format. To do this, we prepared a parsing script to make the annotation file compatible with the cuffcompare
software (trickey trick!):
wget https://drive.switch.ch/index.php/s/ySKNwPmD16GuOQ0/download -O gmap_gff2gtf.py
To run this script, simply make it executable and call it on your experimental annotation file:
chmod +x gmap_gff2gtf.py
./gmap_gff2gtf.py <gmap_output>.gff3 > <gmap_output>.gtf
Do it for all the files for which you generated a gff3 output.
Here's backup of the conversion if you need that.
wget https://drive.switch.ch/index.php/s/BhI4u7C4RXhK5yl/download -O gtf.tar.gz
And now you're ready to compare your annotation to the reference set:
cuffcompare -o <prefix> -G -r <reference_gtf>.gtf <gmap_output>.gtf
# -o: prepends <prefix> to outfile name
# -G: tells cuffcompare that the input experimental GTF file might not come from Cufflinks (it did not, ha!)
# -r: specifies the reference file
The files you're interested in are written to <prefix>.*
. Example:
training@e4c716e7a973:~$ ls -lha cuff_isoseq.*
-rw-rw-r-- 1 training training 14K Nov 8 22:33 cuff_isoseq.combined.gtf
-rw-rw-r-- 1 training training 1.2K Nov 8 22:33 cuff_isoseq.isoseq_out.gtf.refmap
-rw-rw-r-- 1 training training 41K Nov 8 22:33 cuff_isoseq.isoseq_out.gtf.tmap
-rw-rw-r-- 1 training training 14K Nov 8 22:33 cuff_isoseq.loci
-rw-rw-r-- 1 training training 1.1K Nov 8 22:33 cuff_isoseq.stats
-rw-rw-r-- 1 training training 41K Nov 8 22:33 cuff_isoseq.tracking
Go check on them! This page might be useful to check and understand the output.