Example: Long Read superTranscriptome Construction and Visualisation - Oshlack/Lace GitHub Wiki


In this tutorial i will try and go through the steps on how one might construct superTranscripts from long read RNA reads (for example from Oxford Nanopore or PacBio).

If you don't want to read the details but just want the recipe for making a tasty long read super transcriptome just read the following summary:

Quick Overview

  1. De Novo assembled reads into contigs (For example with Trinity or Canu )

  2. Cluster contigs (For examples with RapClust or [Corset] (https://github.com/Oshlack/Corset/wiki) )

  3. Weave together superTranscripts with Lace

  4. Visualise clusters with STViewer (part of the Lace suite)

Step 1 - De Novo Assembly

Here we use Canu which specialise in Long Read data genome assembly, but we could have just as easily used Trinity but we already go through how to use Trinity in another tutorial, so it is good to promote other excellent tools.

N.B. You can download this play dataset here

 canu \
 -p MCF7 -d MCF7-pac \
 genomeSize=3000m \
 -pacbio-raw IsoSeq_MCF7_2015edition_polished.unimapped.fasta

Step 2 - Cluster

Here you can use either RapClust or Corset (in fact they are very similar in approach) RapClust the main difference (apart from speed) is that RapClust uses equivalence classes from pseudo-aligned reads to the transcriptome to cluster using the MCL whereas Corset uses aligned reads to the transcriptome to group using a hierarchical clustering method. Here i will show how to use RapClust (since i highlighted how to use Corset in another tutorial)

##Step 2a - Pseudo-Allign

Before clustering we need to pseudo-align (or align using STAR if we wanted to use Trinity) Using Salmon or Sailfish one can index the transcriptome and pseudoalign using the flag to print the equivalence classes neeed by RapClust(--dumpEq).

cd MCF7-pac
salmon index -t MCF7.contigs.fasta -i transcripts_index --type quasi -k 31
salmon quant -i transcripts_index -l A -r ../IsoSeq_MCF7_2015edition_polished.unimapped.fasta -o transcripts_quant --dumpEq    

Step 2b - RapClust Time

Now because (at the current time of writing) the latest binary version of salmon is not 100% compatible with RapClust we need to rename the folder aux_info -> aux, furthermore we need to ensure we are using python 2.7 since that is what RapClust was written in.

mv transcripts_quant/aux_info/ transcripts_quant/aux

In order to run RapClust one should provide a config.yaml file with details on the samples conditions and output directory, for a single sample as we have this is easy. config.yaml

        - transcripts_quant
outdir: canu_rapclust

Now we can cluster! Hooray! (In a python 2.7 environment)

RapClust --config config.yaml    

Step 3 - Construct superTranscriptome

Now we can use out trusty lace to construct a superTranscript per cluster. But first in order to do so we need to collate the cluster info from RapClust into the right format for Lace, I made a quick script to do that:

python CollateClusters.py   canu_rapclust/mag.clust 

The script contains:

import sys
if(len(sys.argv) <2):
    print("Require cluster output file from RapClust")

    f = open(sys.argv[1],'r')
    fout = open('clusters.txt','w')
    count = 1
    for line in f:
            contigs = line.split()
            for contig in contigs:
                    fout.write(contig + "\t" + "Cluster" + str(count) + "\n")
            count = count + 1


Now run Lace: <Lace Requires Python3 so you will have to switch environments, this is easy with something like conda>

mkdir Laced
python Lace.py MCF7.contigs.fasta clusters.txt -o Laced

Step 4 - Visualise a cluster [Optional]

Now say we want to visualise how the contigs build up the superTranscripts, we can use STViewer.py to take a peek at Cluster2:

cd Laced
python STViewer.py Cluster2.fasta