TranscriptomeAsmONT - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki

ONT Transcriptome

Section: Long-reads Transcriptome [3/4].

The Iso-Seq pipeline cannot be used as it is for MinION raw reads, mainly because of two reasons. First, the raw files are organized in different HDF raw data format (as you previously saw in the read extraction section), and second because the error profile of the consensus reads is quite different. The error rate is also evolving very fast (lowering) with the improvement of the chemistry and/or the model for basecalling. You could see that already comparing the 2D and the 1D2 datasets.

Keeping it at this pace, error correction will soon this will not even be a necessary step!

Actually... you have the isoforms already!

Just need some subsetting for the 2D dataset

At this stage, we can't do better than consider as "transcriptome" the content of the consensus file. This is still the best (and unique) way to get the best transcriptome-wide long-read dataset so far!

We will just have to discard the non-full 2D/reads (e.g. those "complement/template" reads). You can do this simply using bash tools.

# in case you need backup to start from
wget https://drive.switch.ch/index.php/s/aNgUMK2k74XeFHs/download -O poretools_out.fastq.gz    # don't forget to unzip!

# simplification: convert to fasta
cat <file.fastq> | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > <file.fasta>

# select fasta (id/reads) by pattern
cat <file.fasta> | awk '/_Basecall_2D_2d/{nr[NR]; nr[NR+1]}; NR in nr' > <only_2D.fasta>

You have them already in a separate file for 1D2

For the new 1D2 technology, you have already your file ready. Remember! you downloaded 2 files: "_ncorr" (which contains the full run) and another one (which contains the "consensus" only).

# in case you need backup:
wget https://drive.switch.ch/index.php/s/af5fz2MwzeGpHBF/download -O ONT_RNA_1D2.fastq.gz

There you go!

With the very best of your long-reads transcriptome only!

Non-hybrid correction

NOTE: this is NOT advised if your dataset contains data from genes with multiple transcript isoforms. You can anyways test it on your subset (won't cause any harm, just collpase eventually close isoforms).

We will present here a method allowing to correct the MinION reads using the first step of Canu, a software that is mainly used for genome assembly (as you previously saw in the genome assembly-Canu session). You can use this approach to correct DNA reads too, with some parameter tuning.

If you are interested in aspects of error correction, we suggest you to take a look on this report, listing and evaluating the most common error correction software. Here, the authors presented both hybrid and non-hybrid correctors. Hybrid methods use complementary short-reads data, while non-hybrid approaches self-correct reads by exploiting overlap of high-coverage data.

Data are the same as here before.

We will achieve error correction through three steps: (1) selecting the best reads overlap to use for correction, (2) estimating corrected read length, and (3) generating corrected reads. All this is provided when canu is run with the -correct parameter.

canu -correct -p <correct> -d . genomeSize=61000000 -nanopore-raw <file.fasta/q> useGrid=false stopOnReadQuality=false

# -correct: to compute only read correction, no trimming or assembly
# -p: prefix of canu output
# -d: directory to canu output
# genomeSize: estimated assembly size, in our case, transcriptome size
# -nanopore-raw: error profile tuning for minion raw reads
# useGrid=False: disable default LSF
# stopOnReadQuality=false: avoid gatekeeper module to halt as too many of the input reads have been
discarded for the correction - suggested as parameter to avoid program crash as suggested in log, might not be essential for the subset

Backup

In case correction went wrong, here is a backup:

wget https://drive.switch.ch/index.php/s/15CrVZKlo2ZOHaN/download -O corrected.fasta.gz

How does the content of this directory compare with the Canu directory that we had for genome assembly? Why are they different?

Next

Back

⚠️ **GitHub.com Fallback** ⚠️