polishing with Clair3 - rrwick/Perfect-bacterial-genome-tutorial GitHub Wiki

This tutorial uses Medaka to do long-read polishing, but there are other options, most notably Clair3. I've encountered cases where Clair3 does better than Medaka, so it's worth trying both if you want to be thorough. You can then use ALE to assess the results and use the best one in the subsequent short-read polishing step.

However, Clair3 is a variant caller, not a polisher. So to use it as a polisher, you need to first run Clair3 to make a VCF, then apply the VCF changes to the assembly.

Like Medaka, Clair3 has different models trained on different ONT reads, so see the Clair3 documentation to find the best model for your data.

The commands below use these Bash variables. Set them as appropriate for your data/genome/system:

in=trycycler.fasta
out=trycycler_clair3.fasta
ont=../reads_qc/ont.fastq
threads=32
model=r104_e81_sup_g5015
model_url=https://nanoporetech.box.com/shared/static/q1j9htz8eynxcuwwcw860woqp8nhxsic.tgz

Download the model:

wget -O model.tar.gz "$model_url"
tar -xvf model.tar.gz
rm model.tar.gz

Align the reads and run Clair3:

minimap2 -a -x map-ont -t "$threads" "$in" "$ont" | samtools sort > clair3.bam
samtools index clair3.bam
samtools faidx "$in"
run_clair3.sh --bam_fn=clair3.bam --ref_fn="$in" --threads="$threads" --platform="ont" --model_path="$model" --output=clair3 --include_all_ctgs --no_phasing_for_fa --haploid_sensitive

Filter the variants and create a consensus:

bcftools view -i '%QUAL>=10' -O b clair3/merge_output.vcf.gz > clair3/filtered.vcf.gz
bcftools index clair3/filtered.vcf.gz
bcftools consensus -f "$in" clair3/filtered.vcf.gz | seqtk seq > "$out"

Clean up:

rm -r clair3.bam* "$in".fai clair3 "$model"

The above commands use an arbitrary quality threshold of 10. You're welcome to try different thresholds and assess the results (with ALE) to try to find a sweet spot where Clair3 is fixing the most errors while introducing the fewest errors.

⚠️ **GitHub.com Fallback** ⚠️