MinION: error correcting 2D reads with Canu - aechchiki/SIB_LongReadsWorkshop_Zurich17 GitHub Wiki
Introduction
The Iso-Seq pipeline cannot be used as it is for MinION raw reads, mainly because of two reasons. First, the raw files are organized in different HDF raw data format (as you previously saw in the read extraction section), and second because the error profile of the consensus reads is quite different. The error rate is also evolving very fast (lowering) with the improvement of the chemistry and/or the model for basecalling. Keeping it at this pace, soon this will not even be a necessary step - at least not for RNA-seq reads.
We will thus present here a method allowing to correct the MinION reads using the first step of Canu, a software that is mainly used for genome assembly (as you previously saw in the genome assembly session). You can use this approach to correct DNA reads too, with some parameter tuning.
If you are interested in aspects of error correction, we suggest you to take a look on this report, listing and evaluating the most common error correction software. Here, the authors presented both hybrid and non-hybrid correctors. Hybrid methods use complementary short-reads data, while non-hybrid approaches self-correct reads by exploiting overlap of high-coverage data.
Non-hybrid correction
To get started with reads correction, we can use the data you previously extracted . If you had any troubles with extraction, do not worry - we provide a backup:
cd $minion_rna
wget https://drive.switch.ch/index.php/s/aNgUMK2k74XeFHs/download -O poretools_out.fastq.gz
We will achieve error correction through three steps: (1) selecting the best reads overlap to use for correction, (2) estimating corrected read length, and (3) generating corrected reads. All this is provided when canu
is run with the -correct
parameter.
canu -correct -p correct -d . genomeSize=61000000 -nanopore-raw <poretools_out>.fastq.gz useGrid=false stopOnReadQuality=false
# -correct: to compute only read correction, no trimming or assembly
# -p: prefix of canu output
# -d: directory to canu output
# genomeSize: estimated assembly size, in our case, transcriptome size
# -nanopore-raw: error profile tuning for minion raw reads
# useGrid=False: disable default LSF
# stopOnReadQuality=false: avoid gatekeeper module to halt as too many of the input reads have been
discarded for the correction - suggested as parameter to avoid program crash as suggested in log, might not be essential for the subset
In case you had any trouble with the error correction step, don't panic - we provide a backup:
wget https://drive.switch.ch/index.php/s/15CrVZKlo2ZOHaN/download -O corrected.fasta.gz
Next
Go to Checkpoint.
Go back to Table of content .