TranscriptomeAln - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki

Alignment

Section: Transcriptome assessment [2/4].

We will use GMAP as splice-aware aligner suitable for long reads to align the experimental datasets to the reference genome. A splice-aware aligner allows to see the positions to the reference genome directly by mapping RNA-seq reads to the reference genome, not only the transcriptome. This allows to, for example, detect new isoforms other than the annotated (constituting the transcriptome).

The reason why we chose this aligner are mainly two: (1) GMAP appears to be one of the best according to our benchmark work taking as parameters the accuracy and the speed, and (2) allows to directly save the output in GFF format, which is very handy for comparison to the reference annotation (next section).

Get the reference

First, we need to download the reference genome from Ensembl. Alternatively, you can also use the assembly you built in the genome assembly part :)

wget ftp://ftp.ensemblgenomes.org/pub/metazoa/release-36/fasta/drosophila_melanogaster/dna/Drosophila_melanogaster.BDGP6.dna.chromosome.4.fa.gz
gunzip -d Drosophila_melanogaster.BDGP6.dna.chromosome.4.fa.gz
mv Drosophila_melanogaster.BDGP6.dna.chromosome.4.fa Dmel_chr4.fasta

GMAP: map transcripts

Then, we need to build the GMAP hash index on the reference genome. This will create a subfolder containing the index for your genome within the folder from which you execute the commands:

gmap_build -d <gmap_index_name> -D <path/to/index/> Dmel_chr4.fasta
# -d: string for the index name
# -D: path to the genome to be indexed

Finally, we can run the alignment:

gmap -d <gmap_index_name> -D </path/to/index/> <input_reads> -f gff3_match_cdna > <gmap_output>.gff3
# -d: string for the index name
# -D: path to the genome to be indexed
# -f: specifies the output format, in our case gff3

You may repeat the mapping (no need to repeat the indexing!) for fasta/q files of:

  • PacBio IsoSeq
  • ONT 2D
  • ONT 1D2

So we can then align them all to the reference and see how they compare.

Note: ignore warnings concerning "no paths found". This was reported in an issue and actually means that the read is of "too low quality" to be taken into account. Just let it go.

Backup

In case you had any issues computing the previous steps, we provide an archive with the output of the 3 alignments:

wget https://drive.switch.ch/index.php/s/FAOUQXBoztH2tDP/download -O gff3.tar.gz

Minimap2: splice mode

Fro visualization in IGV, we also need an alignment file. We can quickly generate one (actually, three!) with minimap2:

minimap2 -ax splice <ref.fa> <rna-reads.fa> > <file.sam> 

You will then need samtools for conversion to bam, indexing and eventually get some stats:

Samtools

Get & install samtools

wget https://github.com/samtools/samtools/releases/download/1.9/samtools-1.9.tar.bz2 
tar jxf samtools-1.9.tar.bz2 
cd samtools-1.9
./configure --prefix=</path/to/location> # e.g /home/training
make; sudo make install

You'll find utilities in the install location, e.g. /home/training/samtools-1.9.

Don't forget, as usual, to pre-pend this path (or export to path!) for the further commands.

Convert to bam:

samtools view -Sb  <file.sam>  >  <file.bam>

Sort bam file:

samtools sort -T . -o <file.sorted.bam> <file.bam> 

And generate an index on the sorted bam file:

samtools index <file.sorted.bam>

IGV for visualization

Reminder: only the samples of IsoSeq and ont2D come from the same species (D. melanogaster).

Backup!

We prepared here a file with the necessary files to be imported in IGV, in case anything went wrong here above:

wget https://drive.switch.ch/index.php/s/drVrFzg2m3UVYoS/download -O igv.dmel.tar.gz

You can visualize your aligned data with IGV. This is especially meaningful if you have high-coverage of a gene/fragment on interest and you want to explore.

You'll have to:

  • download the executable suitable for your machine
  • import an sorted alignment file with index in the same folder (you got backup!)
  • the corresponding genome of interest (fasta, index & annotation)

Note: to index the reference, you can use samtools too!

samtools faidx <ref.fasta>

How does it look like? Browse a bit.

viz_dmel

Next

Back

⚠️ **GitHub.com Fallback** ⚠️