Arrow - NBISweden/workshop-genome_assembly GitHub Wiki

Arrow: Polishing assemblies

Notes:

  • Arrow needs the pulse field data in order to perform polishing.
    • Sequel data: Pulse field data is in the BAM files.
    • RSII data: Pulse field data is in the h5 files and data needs to be converted to unaligned BAM format.
  • bax2bam creates an unaligned BAM - i.e. it contains sequence and meta data, but no alignment data. An alignment must be made using blasr or pbalign, which accept BAM files as input.
  • When making an alignment, blasr places gaps inconsistently according to their documentation (https://www.pacb.com/wp-content/uploads/SMRT_Tools_Reference_Guide_v600.pdf). However, PacBio recommends in the same document to place gaps consistently to improve consensus calling (polishing), and variant calling. This parameter doesn't appear to be used in any preconfigured PacBio pipeline (for smrtpipe). One can use the flag --placeGapConsistently to consistently place gaps.
  • Word of mouth suggests that 100X coverage for PacBio is sufficient for polishing, and more may even be detrimental. blasr has a parameter --subsample that can be used to randomly subsample and align reads. The --subsample option can be passed to pbalign using the --algorithmOptions parameter. Alternatively, samtools view has an option -s to subsample as well.

Polishing using RSII data:

#! /usr/bin/env bash

module load bioinfo-tools SMRT/5.0.1 samtools/1.9
CPUS="${SLURM_NPROCS:-16}"

PROJ='/proj/uppstoreXXXX'
PACBIO_DATA_DIR="$PROJ/NGI_deliveryXXXXXX/pb_XXX/rawdata/pb_XXX_XXX"
ASSEMBLY="assembly_circularized.fasta"
PREFIX="${ASSEMBLY%.fasta}"
ALIGNMENT="${PREFIX}_alignment.bam"

# Index Assembly
samtools faidx "$ASSEMBLY"

# Old bax h5 format needs to be converted to current unaligned bam format.
bax2bam -o "${PREFIX}" $(find $PACBIO_DATA_DIR -name "*.bax.h5")

# Reads are aligned to assembly - default algorithm is blasr
pbalign --algorithmOptions='--placeGapConsistently' --nproc "$CPUS" --tmpDir "$SNIC_TMP" --unaligned "${PREFIX}.unaligned.bam" "${PREFIX}.subreads.bam" "$ASSEMBLY" "$ALIGNMENT"

# Assembly is polished using Arrow.
arrow --numWorkers "$CPUS" "$ALIGNMENT" -r "$ASSEMBLY" -o "${PREFIX}_polished.fasta" -o "${PREFIX}_polished.fastq" -o "${PREFIX}_variants.gff" -o "${PREFIX}_variants.vcf"