TranscriptomeAsmONT - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki
Section: Long-reads Transcriptome [3/4].
The Iso-Seq pipeline cannot be used as it is for MinION raw reads, mainly because of two reasons. First, the raw files are organized in different HDF raw data format (as you previously saw in the read extraction section), and second because the error profile of the consensus reads is quite different. The error rate is also evolving very fast (lowering) with the improvement of the chemistry and/or the model for basecalling. You could see that already comparing the 2D and the 1D2 datasets.
Keeping it at this pace, error correction will soon this will not even be a necessary step!
At this stage, we can't do better than consider as "transcriptome" the content of the consensus file. This is still the best (and unique) way to get the best transcriptome-wide long-read dataset so far!
We will just have to discard the non-full 2D/reads (e.g. those "complement/template" reads). You can do this simply using bash tools.
# in case you need backup to start from
wget https://drive.switch.ch/index.php/s/aNgUMK2k74XeFHs/download -O poretools_out.fastq.gz # don't forget to unzip!
# simplification: convert to fasta
cat <file.fastq> | awk '{if(NR%4==1) {printf(">%s\n",substr($0,2));} else if(NR%4==2) print;}' > <file.fasta>
# select fasta (id/reads) by pattern
cat <file.fasta> | awk '/_Basecall_2D_2d/{nr[NR]; nr[NR+1]}; NR in nr' > <only_2D.fasta>
For the new 1D2 technology, you have already your file ready. Remember! you downloaded 2 files: "_ncorr" (which contains the full run) and another one (which contains the "consensus" only).
# in case you need backup:
wget https://drive.switch.ch/index.php/s/af5fz2MwzeGpHBF/download -O ONT_RNA_1D2.fastq.gz
With the very best of your long-reads transcriptome only!
NOTE: this is NOT advised if your dataset contains data from genes with multiple transcript isoforms. You can anyways test it on your subset (won't cause any harm, just collpase eventually close isoforms).
We will present here a method allowing to correct the MinION reads using the first step of Canu, a software that is mainly used for genome assembly (as you previously saw in the genome assembly-Canu session). You can use this approach to correct DNA reads too, with some parameter tuning.
If you are interested in aspects of error correction, we suggest you to take a look on this report, listing and evaluating the most common error correction software. Here, the authors presented both hybrid and non-hybrid correctors. Hybrid methods use complementary short-reads data, while non-hybrid approaches self-correct reads by exploiting overlap of high-coverage data.
Data are the same as here before.
We will achieve error correction through three steps: (1) selecting the best reads overlap to use for correction, (2) estimating corrected read length, and (3) generating corrected reads. All this is provided when canu
is run with the -correct
parameter.
canu -correct -p <correct> -d . genomeSize=61000000 -nanopore-raw <file.fasta/q> useGrid=false stopOnReadQuality=false
# -correct: to compute only read correction, no trimming or assembly
# -p: prefix of canu output
# -d: directory to canu output
# genomeSize: estimated assembly size, in our case, transcriptome size
# -nanopore-raw: error profile tuning for minion raw reads
# useGrid=False: disable default LSF
# stopOnReadQuality=false: avoid gatekeeper module to halt as too many of the input reads have been
discarded for the correction - suggested as parameter to avoid program crash as suggested in log, might not be essential for the subset
In case correction went wrong, here is a backup:
wget https://drive.switch.ch/index.php/s/15CrVZKlo2ZOHaN/download -O corrected.fasta.gz
How does the content of this directory compare with the Canu directory that we had for genome assembly? Why are they different?