GenomePolish - aechchiki/SIB_LongReadsWorkshop_Zurich18 GitHub Wiki

Polishing

Section: Genome Assembly [4/5].

Why need polishing?

In a real application (eukaryotic genome), we still had a rather high error rate (of > 10% with Miniasm) which we needed to correct for (this rate was much lower with Canu <1% - but it still benefits from polishing). In the real world, polishing (= consensus) might be the (only?) solution for enormous plant genomes, or project where a huge amount of genomes are sequenced.

The actual sequence of genome assembly is not perfect, the calculation of consensus is limited by the quality scores of each base, however the signal is more complex than that. Both PacBio and MinION have tools that exploit detailed information stored in raw sequencing data.

The MinION consensus program is called Nanopolish and the consensus programs for PacBio are called quiver (for RSII) and arrow (for Sequel), delivered together in a package called GenomicConsensus. Note that polishing is suited only for haploid sequences, here is a quote from GenomicConsensus FAQ : "At present, quiver/arrow assume a haploid sample, and the behavior of on sample mixtures or diploid/polyploid samples is undefined."

If ever you'll need to proceed to polishing yourself, we suggest you to take a look here, where we explored (in the context of a 2-day tutorial on genome assembly alone), three of the main polishing software:

  • non-specific: Racon
  • PacBio-specific: Quiver
  • Nanopore-specific: Nanopolish

You can try it out with our "subset" data, but for us it will not make a huge difference.

Consensus: Racon

Racon is a fast, standalone module (and platform-agnostic!) for computing consensus step of genome assembly for assemblers that don't do that (e.g. Miniasm!). Assuming no compilation issues (pretty optimistic huh!), getting a polished sequence with Racon is pretty straightforward.

Bioconda recipe: racon

You can install Racon from Github or bioconda, as did before for some PacBio tools.

(if didn't do it before) You need to:

  • install conda (3.7):

    • get the installer:
    • wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
  • follow instructions:

    • launch it! bash Miniconda3-latest-Linux-x86_64.sh -f
    • press "enter" until end of License terms
    • enter "yes" (to accept the License terms)
    • enter a non-existing location for installation (you can keep the suggested one - for example: /home/training/miniconda3/)
    • enter "yes" (to prepend the install location to PATH in the .bashrc)
    • open a new terminal -> tadah!
    • check with: which conda
      • this should point to your installation location (following the example: /home/training/miniconda3/bin)
      • if not, prepend the location to EVERY conda command here below! (e.g. conda -> /home/training/miniconda3/bin/conda)
  • setup the necessary channels:

    • conda config --add channels defaults
    • conda config --add channels bioconda
    • conda config --add channels conda-forge

! NEW STUFF

  • install your racon:
    • conda install racon
      • enter "yes" (when asked to proceed to installation)
    • check with: which racon
      • this should point to your installation location (following the example: /home/training/miniconda3/bin)
      • if not, prepend the location to EVERY conda command here below! (e.g. racon -> /home/training/miniconda3/bin/racon)

Test your installation with:

racon -h      # or: /home/training/miniconda3/bin/racon -h

Data

You'll need three files to run Racon:

  1. the reads
  2. the assembly (e.g. generated with whatever software, e.g. Miniasm or Canu)
  3. the aligned reads to the assembly (e.g. generated with Minimap2)

Here are the backups files if anything went wrong:

# miniasm assembly backup
wget https://drive.switch.ch/index.php/s/TiTZjzUgMZzRNYo/download -O miniasm.tar.gz
# dna reads
wget https://drive.switch.ch/index.php/s/3yei57kRvBzOOPK/download -O dmel_ch4_reads.fastq.gz

With this backup (or the files you generate!), build first the alignment file:

(rem: minimap2 is installed already! minimap2 -h if any help is required)

minimap2 -ax map-pb <contigs.fa> <reads.fastq.gz> > <mapped_reads.sam>

Polish

Assuming everything went smoothly:

racon <reads.fastq.gz> <mapped_reads.sam> <contigs.fa> > <polished_racon.fasta>

Yes, as easy as this!

Backup

In case anything went wrong, here we provide polishing backup:

# racon, pb RNA reads
wget https://drive.switch.ch/index.php/s/uo0c5HQLaMtV32W/download -O racon.fasta.gz

But where is my assembly graph?

Racon does not provide (yet) a gfa output, that you could visualize with bandage (and visually compare to the draft assembly). However, we can do that with a little awk magic (and the seqtk toolkit).

# get & compile seqtk
git clone https://github.com/lh3/seqtk.git
cd seqtk; make

# convert fasta (from racon) to gfa, using the draft assembly (gfa) as guide
seqtk seq <racon.fasta> | gawk -vOFS='\t' 'ARGIND == 1 { id = substr($1, 2); getline; x[id] = $1; next } $1 == "S" && x[$2] { $3 = x[$2] } 1' - <contigs.gfa> > <racon.gfa>

Backup

In case anything went wrong, or didn't have time to convert, here is your gfa file from racon:

wget https://drive.switch.ch/index.php/s/eDDymAa4ce4atSw/download -O racon.gfa.gz

That you can unzip and import to bandage.

Did it improve the assembly?

Graphs

Miniasm output:

Visualization example

Racon post-Miniasm:

Visualization example

Stats

For Miniasm:

Visualization example

Racon post-Miniasm:

Visualization example

For bacterial genomes, FYI

Unicycler is a nice pipeline combining the steps from assembly generation to polishing.

Next

Back

⚠️ **GitHub.com Fallback** ⚠️