TranscriptomeAln - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki
Section: Transcriptome assessment [2/4].
We will use GMAP as splice-aware aligner suitable for long reads to align the experimental datasets to the reference genome. A splice-aware aligner allows to see the positions to the reference genome directly by mapping RNA-seq reads to the reference genome, not only the transcriptome. This allows to, for example, detect new isoforms other than the annotated (constituting the transcriptome).
The reason why we chose this aligner are mainly two: (1) GMAP appears to be one of the best according to our benchmark work taking as parameters the accuracy and the speed, and (2) allows to directly save the output in GFF format, which is very handy for comparison to the reference annotation (next section).
First, we need to download the reference genome from Ensembl. Alternatively, you can also use the assembly you built in the genome assembly part :)
wget ftp://ftp.ensemblgenomes.org/pub/metazoa/release-36/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.dna.chromosome.4.fa.gz
gunzip -d Drosophila_melanogaster.BDGP6.dna.chromosome.4.fa.gz
mv Drosophila_melanogaster.BDGP6.dna.chromosome.4.fa Dmel_chr4.fasta
Then, we need to build the GMAP hash index on the reference genome. This will create a subfolder containing the index for your genome within the folder from which you execute the commands:
gmap_build -d <gmap_index_name> -D <path/to/index/> Dmel_chr4.fasta
# -d: string for the index name
# -D: path to the genome to be indexed
Finally, we can run the alignment:
gmap -d <gmap_index_name> -D </path/to/index/> <input_reads> -f gff3_match_cdna > <gmap_output>.gff3
# -d: string for the index name
# -D: path to the genome to be indexed
# -f: specifies the output format, in our case gff3
You may repeat the mapping (no need to repeat the indexing!) for fasta/q files of:
- PacBio IsoSeq
- ONT 2D
- ONT 1D2
So we can then align them all to the reference and see how they compare.
Note: ignore warnings concerning "no paths found". This was reported in an issue and actually means that the read is of "too low quality" to be taken into account. Just let it go.
In case you had any issues computing the previous steps, we provide an archive with the output of the 3 alignments:
wget https://drive.switch.ch/index.php/s/FAOUQXBoztH2tDP/download -O gff3.tar.gz
Fro visualization in IGV, we also need an alignment file. We can quickly generate one (actually, three!) with minimap2:
minimap2 -ax splice <ref.fa> <rna-reads.fa> > <file.sam>
You will then need samtools
for conversion to bam, indexing and eventually get some stats:
Get & install samtools
wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2
tar jxf samtools-1.9.tar.bz2
cd samtools-1.9
./configure --prefix=</path/to/location> # e.g /home/training
make; sudo make install
You'll find utilities in the install location, e.g. /home/training/samtools-1.9
.
Don't forget, as usual, to pre-pend this path (or export to path!) for the further commands.
Convert to bam:
samtools view -Sb <file.sam> > <file.bam>
Sort bam file:
samtools sort -T . -o <file.sorted.bam> <file.bam>
And generate an index on the sorted bam file:
samtools index <file.sorted.bam>
Reminder: only the samples of IsoSeq and ont2D come from the same species (D. melanogaster).
We prepared here a file with the necessary files to be imported in IGV, in case anything went wrong here above:
wget https://drive.switch.ch/index.php/s/drVrFzg2m3UVYoS/download -O igv.dmel.tar.gz
You can visualize your aligned data with IGV. This is especially meaningful if you have high-coverage of a gene/fragment on interest and you want to explore.
You'll have to:
- download the executable suitable for your machine
- import an sorted alignment file with index in the same folder (you got backup!)
- the corresponding genome of interest (fasta, index & annotation)
Note: to index the reference, you can use samtools too!
samtools faidx <ref.fasta>
How does it look like? Browse a bit.