Influencing Autocycler via contig headers - rrwick/Autocycler GitHub Wiki

Autocycler's behaviour can be influenced using special text in the headers of input contigs which affect how contigs are treated during clustering and consensus steps. Three types of header hints are supported: trusted, cluster weight and consensus weight.

Trusted

To mark a contig as trusted, include Autocycler_trusted (case insensitive) in its FASTA header. This affects the Autocycler cluster step. Any cluster containing a trusted contig will automatically pass quality control (QC), even if it would normally fail.

Example 1: small plasmid recovery

  • 12 input assemblies were generated, each containing the chromosome. One assembly also includes a small plasmid.
  • When Autocycler cluster is run, two clusters are created: one for the chromosome (12 contigs) and one for the plasmid (1 contig).
  • By default, the plasmid cluster fails QC (i.e. is placed in clustering/qc_fail/), with the message: failed QC: present in too few assemblies
  • This is because the plasmid is present in only 1 of 12 assemblies (8.3%), below the 25% default threshold for --min_assemblies (read more here).
  • However, if the plasmid contig is labelled as trusted (Autocycler_trusted in its FASTA header), the cluster will pass QC (i.e. placed in clustering/qc_pass/), ensuring it is retained for downstream steps.

Example 2: phage excision

  • The bacterial genome contains a prophage, and in some cells the phage has excised to form a separate circular replicon.
  • Some input assemblies contain just the chromosome (with integrated prophage) while others also include a separate phage contig.
  • Normally, the phage cluster fails QC with the message: failed QC: contained within cluster 1
  • This happens because Autocycler recognises the phage sequence as a subset of the chromosome.
  • If the phage contig is labelled as trusted, its cluster passes QC so the separate phage will be included in the final assembly.

Note that trusted contigs still undergo trimming and consensus resolution. Marking a contig as trusted ensures it is included in clustering, but it does not guarantee the sequence will be included unchanged in the final output.

Cluster weight

To modify how a contig contributes to clustering, add Autocycler_cluster_weight= (case insensitive) followed by an integer. For example: Autocycler_cluster_weight=2.

This affects the Autocycler cluster step. Normally, clusters must include contigs from a minimum number of input assemblies (--min_assemblies) to pass QC. Increasing the cluster weight of a contig boosts the effective number of assemblies contributing to that cluster, making it more likely to pass QC.

Example: boosting plasmid support

  • 12 input assemblies were generated, each with the chromosome, and two that also contain a plasmid.
  • The plasmid cluster has 2/12 contigs (16.7%) and so fails the default 25% --min_assemblies threshold.
  • If both plasmid contigs have Autocycler_cluster_weight=2, they are counted twice.
  • The adjusted percentage is then (2×2)/12 = 33.3%, so the plasmid cluster now passes QC.

This feature is used in my full Autocycler pipeline, which adds a cluster weight of 2 to circular contigs from Plassembler. Since Plassembler often recovers small plasmids missed by other assemblers, boosting their weight helps ensure those plasmids are retained in the final assembly.

Consensus weight

To modify how a contig contributes to consensus building, add Autocycler_consensus_weight= (case insensitive) followed by an integer. For example: Autocycler_consensus_weight=2.

This affects the Autocycler resolve step. During consensus generation, Autocycler considers the most common path through the graph across all contigs in a cluster. Increasing a contig's consensus weight means it will contribute more heavily to the decision at variable sites.

Example: homopolymer length

  • A genome has a homopolymer site where the input assemblies disagree: 7 contigs have A×13, 5 have A×14.
  • Normally, the consensus at that site is A×13 (since 7 > 5).
  • If one A×14 contig has Autocycler_consensus_weight=4, it now contributes 4 votes.
  • This raises the number of votes for A×14 to 8, causing the consensus to become A×14.

This is also used in my full Autocycler pipeline, which assigns a consensus weight of 2 to contigs from Flye and Canu, as those tools tend to produce higher sequence accuracy than other long-read assemblers.