Arrow - NBISweden/workshop-genome_assembly GitHub Wiki
Arrow: Polishing assemblies
Notes:
- Arrow needs the pulse field data in order to perform polishing.
- Sequel data: Pulse field data is in the BAM files.
- RSII data: Pulse field data is in the h5 files and data needs to be converted to unaligned BAM format.
bax2bam
creates an unaligned BAM - i.e. it contains sequence and meta data, but no alignment data. An alignment must be made usingblasr
orpbalign
, which accept BAM files as input.- When making an alignment,
blasr
places gaps inconsistently according to their documentation (https://www.pacb.com/wp-content/uploads/SMRT_Tools_Reference_Guide_v600.pdf). However, PacBio recommends in the same document to place gaps consistently to improve consensus calling (polishing), and variant calling. This parameter doesn't appear to be used in any preconfigured PacBio pipeline (forsmrtpipe
). One can use the flag--placeGapConsistently
to consistently place gaps. - Word of mouth suggests that 100X coverage for PacBio is sufficient for polishing, and more may even be detrimental.
blasr
has a parameter--subsample
that can be used to randomly subsample and align reads. The--subsample
option can be passed topbalign
using the--algorithmOptions
parameter. Alternatively,samtools view
has an option-s
to subsample as well.
Polishing using RSII data:
#! /usr/bin/env bash
module load bioinfo-tools SMRT/5.0.1 samtools/1.9
CPUS="${SLURM_NPROCS:-16}"
PROJ='/proj/uppstoreXXXX'
PACBIO_DATA_DIR="$PROJ/NGI_deliveryXXXXXX/pb_XXX/rawdata/pb_XXX_XXX"
ASSEMBLY="assembly_circularized.fasta"
PREFIX="${ASSEMBLY%.fasta}"
ALIGNMENT="${PREFIX}_alignment.bam"
# Index Assembly
samtools faidx "$ASSEMBLY"
# Old bax h5 format needs to be converted to current unaligned bam format.
bax2bam -o "${PREFIX}" $(find $PACBIO_DATA_DIR -name "*.bax.h5")
# Reads are aligned to assembly - default algorithm is blasr
pbalign --algorithmOptions='--placeGapConsistently' --nproc "$CPUS" --tmpDir "$SNIC_TMP" --unaligned "${PREFIX}.unaligned.bam" "${PREFIX}.subreads.bam" "$ASSEMBLY" "$ALIGNMENT"
# Assembly is polished using Arrow.
arrow --numWorkers "$CPUS" "$ALIGNMENT" -r "$ASSEMBLY" -o "${PREFIX}_polished.fasta" -o "${PREFIX}_polished.fastq" -o "${PREFIX}_variants.gff" -o "${PREFIX}_variants.vcf"