Automated assembly - rrwick/Perfect-bacterial-genome-tutorial GitHub Wiki
This tutorial assumes that you are aiming for the best possible bacterial genome assembly, and that means some human judgment and intervention is needed. This is particularly true for the Trycycler assembly step and Checking differences in IGV.
However, in many contexts, an automated assembly workflow is needed, and this page gives my recommendations. This automated workflow differs from the main tutorial by using Flye instead of Trycycler for long-read assembly and limiting short-read polishing to the tools with the lowest rates of introduced errors (Polypolish and POLCA).
While this assembly method can often work very well, you should not assume the results are error-free. In particular, structural assembly errors (e.g. fragmented replicons, doubled plasmids) are possible, as these are what Trycycler aims to avoid.
Requirements
- Read alignment: minimap2, BWA and Samtools
- Read QC: Filtlong and fastp
- Long-read assembly: Flye and the
rotate_circular_gfa.py
script (which you can find in the scripts directory of this repo) - Long-read polishing: Medaka
- Short-read polishing: Polypolish and POLCA
- Sequence file manipulation: seqtk
Set variables
The commands below use these Bash variables. Set them as appropriate for your data/genome/system:
i1=reads/S_aureus_JKD6159_Illumina_1.fastq.gz
i2=reads/S_aureus_JKD6159_Illumina_2.fastq.gz
ont=reads/S_aureus_JKD6159_ONT_R10.4_guppy_v6.1.7.fastq.gz
medaka_model=r104_e81_sup_g610
genome_size=3000000
threads=16
Read QC
mkdir reads_qc
fastp --in1 "$i1" --in2 "$i2" --out1 reads_qc/illumina_1.fastq.gz --out2 reads_qc/illumina_2.fastq.gz
mv fastp.* reads_qc
filtlong --target_bases "$genome_size"00 "$ont" > reads_qc/ont.fastq # aim for 100x depth
Flye assembly
flye --nano-hq reads_qc/ont.fastq --threads "$threads" --out-dir flye
rotate_circular_gfa.py flye/assembly_graph.gfa > flye.fasta
rm -r flye # clean up
The rotate_circular_gfa.py
script converts the Flye GFA file to FASTA format, rotating any circular contigs by half their length. This serves to move any missing/duplicated bases at the start/end of contigs into the middle of the sequence where polishing can fix the error.
Medaka polishing
medaka_consensus -i reads_qc/ont.fastq -d flye.fasta -o medaka -m "$medaka_model"
seqtk seq medaka/consensus.fasta > flye_medaka.fasta
rm -r medaka *.fai *.mmi # clean up
Polypolish polishing
bwa index flye_medaka.fasta
bwa mem -t "$threads" -a flye_medaka.fasta reads_qc/illumina_1.fastq.gz > alignments_1.sam
bwa mem -t "$threads" -a flye_medaka.fasta reads_qc/illumina_2.fastq.gz > alignments_2.sam
polypolish_insert_filter.py --in1 alignments_1.sam --in2 alignments_2.sam --out1 filtered_1.sam --out2 filtered_2.sam
polypolish flye_medaka.fasta filtered_1.sam filtered_2.sam > flye_medaka_polypolish.fasta
rm *.bwt *.pac *.ann *.amb *.sa *.sam # clean up
POLCA polishing
polca.sh -a flye_medaka_polypolish.fasta -r reads_qc/illumina_1.fastq.gz" "reads_qc/illumina_2.fastq.gz -t "$threads"
seqtk seq flye_medaka_polypolish.fasta.PolcaCorrected.fa | paste - - | sort | tr '\t' '\n' > flye_medaka_polypolish_polca.fasta # restore contig order
rm *PolcaCorrected.fa *.amb *.ann *.bai *.bam *.batches *.bwt *.fai *.names *.pac *.report *.sa *.sam *.success *.vcf *.err # clean up
The resulting flye_medaka_polypolish_polca.fasta
file contains the final assembly.