How it works - milnus/Corekaburra GitHub Wiki

How it works

Purpose

The aim of Corekaburra is to take information from a pangenome and process it into a form that can be more readily understood and compared. Typically a pangenome program such as roary will output the following information.

Whether a gene is "core" or "accessory" (or other type of group) based on percent presence in the set genomes examined.
The presence or absence of genes in each genome.
The order of genes in each genome (relies on locus tags being correctly numbered).

Whilst this information is useful in assessing individual genes it can fall flat when trying to determine:

Which genes are associated together and/or in proximity to one another (such as in mobile genetic elements or, operons or loci of accessory genes).
What is the order of the genes in every genome (particularly problematic when correcting annotations where non sequential locus tags are added, as is the case for Panaroo).
Which core genes consistently contain accessory elements between them (insertion regions).

Corekaburra seeks to address these limitations by integrating core gene synteny, allowing for a look at the pangenome in the context of shared elements and insertion regions rather than just shared gene content.

Core gene determination

The first step in corekaburra is defining core genes by presence and structure, using a threshold defined by the user. This generally is 100%, though can be dropped to account for sporadically missing or truncated genes due to misassembly, misannotation, or biological oddities.
For core genes consisting of smaller fragments, all fragments must be neighbouring one another on the same contig, with no other annotations between them. This decision is not perfect, but is the easiest in terms of parsing files and outputs. It will to some extent lower the number of core genes handled by Corekaburra. (If you have an idea of how to handle this problem let us know!)

Determine_core_gene

Core core pairs

The next step is to look at each genome and determine:

The order of genes (done by referring to the original gff files).
Which core genes are adjacent in the genome (or abutting a sequence break).
How many base pairs and what number of accessory genes are between each core-core pair.

Core_pairs_figure_4

In the above example the core-core pairs determined would be:
sb-a
a-b
b-c
c-d
d-sb

where sb is sequence break/ends of contigs

Accessory content

When core-core pairs are determined, we can begin to determine what accessory occurs between core genes. This can be summarised through both the size (in number of basepairs) and the number of genes annotated in this accessory segment.

Accessory_insertion_larger_larger

In the above example there are 3 genes inserted in between core-core pair y-b in genome 1. Whilst no genes are inserted in genome 2.

Looking for core gene segments

Another type of genomic change that can be difficult to detect in pangenome outputs are genomic rearrangements. Corekaburra allows users to look for these by identifying segments of core genes that are 'stable' across genomes. A segment is ≥2 core genes that are found to pair in all possible genomes where they co-occur.
to identify segments Corekaburra:

Builds a graph with core genes as nodes and edges between nodes weighted by the number of times core genes are found in a pair together.
Identifies all core genes that are connected to >2 other core genes (not counting sequence breaks) across all genomes (multi connected genes)
Searches all paths between multi connected genes that do not hold a third multi connected gene. (segment)

These segments are 'stable' (not broken by rearrangements or genes that move to other parts of the genome) and can therefore be used to examine rearrangements.

Corekaburra also identifies another type of segment (sub-segments), which are segments not interrupted by insertion of accessory genes. These are simply found by passing the 'stable' core segments and making sure no accessory genes are inserted between core pairs across all genomes.

Stable_core_segment_search_larger

In the above example we have three genomes with two core genes having ≥3 core gene neighbours. Core gene y neighbours x, c, and m and core gene n neighbours m, c and o. When searching for 'stable' core segments in the three genomes Corekaburra would identify: x-y, c-a-b-m, and n-o as segments. When looking for sub-segments c-a-b-m would be divided into c-a and b-m, due to the presence of the accessory gene i between a and b in genome 3.

Changes in core pairs like y-c and m-n in genome 1 and y-m and c-n in genome 2 are indicative of a rearrangement. These can clearly be seen when comparing a handful of genomes or the gff files but would be difficult to discern from the pangenome output alone, without core gene synteny.

Dealing with sequence breaks

Until every genome we sequence is a 'closed/complete' genome we have to deal with 'incomplete/broken' genomes. This results in some core genes being next to sequence breaks. When this happens, Corekaburra will report this. It will, as with core core pairs, report the number of basepairs and potential accessory genes until the break occurs.

This information can be useful in some cases, where sequence breaks are induced by specific events, such as insertion of mobile genetic elements. In the example figure genome 1 carries no prophage and assembles nicely. Genome 2 carries a prophage between core gene b and c. In genome 3 b neighbours a sequence break, but in between b and the break is accessory gene P indicating a potential insertion of some mobile genetic element, and with knowledge of genome 2 we could suspect it being a prophage of sorts. Our suspicion is increased by the presence of E in the region between the sequence break and c.

Sequence_break_use_larger

One thing to note about sequence breaks and their reporting is the ordering of the pair that includes a sequence break. Normally names of core genes in a pair are sorted to construct the name of the pair. This is not the case for sequence breaks. This makes it so that a core gene, X can be in pairs named sequence_break-X and X-sequence_break. This does not mean that these pairs are different, but only that X was found to be one either one end or the other in separate genomes. This decision makes it so that you can not have the 'same' core-pair twice in a genome, in the event of a single core gene on a contig, which would result in two pairs with the same name (sequence_break-X) in the same genome.

Reporting and outputs

For a full description of output files see the section on this.