GenomePolish - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki
Section: Genome Assembly [4/5].
In a real application (eukaryotic genome), we still had a rather high error rate (of > 10% with Miniasm) which we needed to correct for (this rate was much lower with Canu <1% - but it still benefits from polishing). In the real world, polishing (= consensus) might be the (only?) solution for enormous plant genomes, or project where a huge amount of genomes are sequenced.
The actual sequence of genome assembly is not perfect, the calculation of consensus is limited by the quality scores of each base, however the signal is more complex than that. Both PacBio and MinION have tools that exploit detailed information stored in raw sequencing data.
The MinION consensus program is called Nanopolish and the consensus programs for PacBio are called quiver
(for RSII) and arrow
(for Sequel), delivered together in a package called GenomicConsensus. Note that polishing is suited only for haploid sequences, here is a quote from GenomicConsensus FAQ : "At present, quiver/arrow assume a haploid sample, and the behavior of on sample mixtures or diploid/polyploid samples is undefined."
If ever you'll need to proceed to polishing yourself, we suggest you to take a look here, where we explored (in the context of a 2-day tutorial on genome assembly alone), three of the main polishing software:
- non-specific: Racon
- PacBio-specific: Quiver
- Nanopore-specific: Nanopolish
You can try it out with our "subset" data, but for us it will not make a huge difference.
Racon is a fast, standalone module (and platform-agnostic!) for computing consensus step of genome assembly for assemblers that don't do that (e.g. Miniasm!). Assuming no compilation issues (pretty optimistic huh!), getting a polished sequence with Racon is pretty straightforward.
You can install Racon from Github or bioconda, as did before for some PacBio tools.
(if didn't do it before) You need to:
-
install
conda
(3.7):- get the installer:
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
-
follow instructions:
- launch it!
bash Miniconda3-latest-Linux-x86_64.sh -f
- press "enter" until end of License terms
- enter "yes" (to accept the License terms)
- enter a non-existing location for installation (you can keep the suggested one - for example:
/home/training/miniconda3/
) - enter "yes" (to prepend the install location to PATH in the .bashrc)
- open a new terminal -> tadah!
- check with:
which conda
- this should point to your installation location (following the example:
/home/training/miniconda3/bin
) - if not, prepend the location to EVERY conda command here below! (e.g.
conda
->/home/training/miniconda3/bin/conda
)
- this should point to your installation location (following the example:
- launch it!
-
setup the necessary channels:
conda config --add channels defaults
conda config --add channels bioconda
conda config --add channels conda-forge
! NEW STUFF
- install your racon:
-
conda install racon
- enter "yes" (when asked to proceed to installation)
- check with:
which racon
- this should point to your installation location (following the example:
/home/training/miniconda3/bin
) - if not, prepend the location to EVERY conda command here below! (e.g.
racon
->/home/training/miniconda3/bin/racon
)
- this should point to your installation location (following the example:
-
Test your installation with:
racon -h # or: /home/training/miniconda3/bin/racon -h
You'll need three files to run Racon:
- the reads
- the assembly (e.g. generated with whatever software, e.g. Miniasm or Canu)
- the aligned reads to the assembly (e.g. generated with Minimap2)
Here are the backups files if anything went wrong:
# miniasm assembly backup
wget https://drive.switch.ch/index.php/s/TiTZjzUgMZzRNYo/download -O miniasm.tar.gz
# dna reads
wget https://drive.switch.ch/index.php/s/3yei57kRvBzOOPK/download -O dmel_ch4_reads.fastq.gz
With this backup (or the files you generate!), build first the alignment file:
(rem: minimap2
is installed already! minimap2 -h
if any help is required)
minimap2 -ax map-pb <contigs.fa> <reads.fastq.gz> > <mapped_reads.sam>
Assuming everything went smoothly:
racon <reads.fastq.gz> <mapped_reads.sam> <contigs.fa> > <polished_racon.fasta>
Yes, as easy as this!
In case anything went wrong, here we provide polishing backup:
# racon, pb RNA reads
wget https://drive.switch.ch/index.php/s/uo0c5HQLaMtV32W/download -O racon.fasta.gz
Racon does not provide (yet) a gfa output, that you could visualize with bandage (and visually compare to the draft assembly). However, we can do that with a little awk magic (and the seqtk toolkit).
# get & compile seqtk
git clone https://github.com/lh3/seqtk.git
cd seqtk; make
# convert fasta (from racon) to gfa, using the draft assembly (gfa) as guide
seqtk seq <racon.fasta> | gawk -vOFS='\t' 'ARGIND == 1 { id = substr($1, 2); getline; x[id] = $1; next } $1 == "S" && x[$2] { $3 = x[$2] } 1' - <contigs.gfa> > <racon.gfa>
In case anything went wrong, or didn't have time to convert, here is your gfa file from racon:
wget https://drive.switch.ch/index.php/s/eDDymAa4ce4atSw/download -O racon.gfa.gz
That you can unzip and import to bandage.
Miniasm output:
Racon post-Miniasm:
For Miniasm:
Racon post-Miniasm:
Unicycler is a nice pipeline combining the steps from assembly generation to polishing.