Inputs and outputs - milnus/Corekaburra GitHub Wiki

Inputs and outputs

This page aims at providing you with necessary information to format inputs and understand outputs from Corekaburra.

Inputs

Getting your data to a Corekaburra compatible stage should be solved by using Panaroo or Roary to construct your pan-genome. The two required inputs for Corekaburra are a pan-genome folder and a set of genomes used to build it. In the following section you will find a deeper explanation on required and optional inputs.

Genomes

Genomes should follow a 'Prokka style' Gff3 format. Here the ##FASTA flag is given after the annotation part of the Gff file. The genome fasta following the annotations is important for Corekaburra to calculate the distance between core genes and sequence breaks.
Corekaburra will use information on lines with CDS in the feature column (third column) of the input gff files. Genomes can be inputted as both uncompressed and gzipped files.

Pan-genome

This is the path to the output folder of your pan-genome tool. Corekaburra will then determine what type of tool was used and handle input files appropriately. Currently only Panaroo and Roary are recognized, but in theory any tools can be used as long as its output can be converted into a gene_presence_absence.tsv file in a Roary like format. To format comply with Corekaburra's expectations of a pan-genome input:

  1. Columns 1, 4, and 5 of the Roary gene_presnece_absence.tsv format must be given in these positions. Remaining columns can be left empty. Column 1 is the name of the gene cluster. Column 4 is the number of isolates represented in the gene cluster, and column 5 represents the number of sequences in the gene cluster. Column 15 and above must represent presence and absence of sequences given by locus_tag of genes present in a cluster, and left blank for absence. Fragmented genes should be concatenated using semicolon (;) and all values must be double-quoted (").
  2. The Roary like file described above must be named: gene_presnece_absence.tsv and placed in a folder that is given as input to -ip.

Complete genomes

As Gff3 files do not always come with a is_circular indication in the file, as exemplified here, Corekaburra users have to explicitly specify circular/complete genomes. All, none, or some genomes can be given as circular using the -cg flag. The input file for -cg is a plain text file with each genome on a separate line as either the base filename, full filename or filename with a path (full or relative does not matter). An example input file that Corekaburra can interpret:

/path/to/complete_genome_1.gff
/path/complete_genome_2.gff.gz
complete_genome_3.gff.gz
complete_genome_4.gff
complete_genome_5

Core percentage and low-frequency percentage

Changing the percentage of genomes in which a gene must be present to be called a core gene is often variable from one analysis to another. Due to this Corekaburra allows users to change this parameter with -cc (core cutoff). This is by default set to 1.0 (100%) presence across all genomes. Presence for core genes is determined by the number of genomes multiplied by the core cutoff, rounded-down.

A secondary cutoff value -lc (low cutoff) is given and influences the last column of the core_core_accessory_gene_content.tsv file. The input to this parameter can not be larger than the core cutoff, and is calculated by the number of genomes multiplied by the low cutoff value, rounded up.

Outputs

An important thing to note on the output of Corekaburra is the sorting of gene names in a core-pair. The two core genes in a pair are ordered by alphabetical order, if none of the 'genes' are a sequence break. If a sequence break is identified as part of a pair, the pair is not sorted. Doing so could result in pairs with the same pair occurring multiple times in the same genome (imagine a core gene being the only one on a contig), making these pairs non-unique in their naming. Therefore be mindful that sequence-break pairs are not sorted and depending on your analysis you can ignore this or sort them yourself after the fact.

core_pair_summary.tsv

This file is a Go-To file to get a glance of the results. It gives an overview of all core pairs found across the pan-genome. Reported is how often: a pair occurs, each core gene occurs, and the core gene co-occurrence. Basepairs and accessory genes between core pairs are reported with minimum, maximum, mean and median values.

This output lets the user quickly sort core pairs by feature of interest, and is often a good point to start a down-stream analysis.

low_frequency_gene_placement.tsv

A record of all core pairs in each genome and the number of nucleotides and accessory genes between them. This file is often too large to manually inspect but is useful for querying during analysis.

core_core_accessory_gene_content.tsv

File holding the genomes of core pairs that have accessory between them. The last column provides a percent presence category of low_frequency or intermediate_frequeny. By default the cut-off is at 0.01%, but can be user-defined. Important to note: All core pairs will not appear in this file, only the ones with accessory genes between them. This also means that the same core pair can appear multiple times, if more than one accessory gene is found between a core pair.

Core_gene_graph.gml

This is a graph with core genes as nodes and edges symbolising connections between them across genomes. The weight of edges is the number of times two core genes are connectd. This graph can be loaded into Cytoscape the same way as a pan-genome graph from Panaroo can be. This graph can be usefull when searching for genomic inversions across genomes.

core_segments.tsv

Contains the segments of 'stable' core gene pairs identified by Corekaburra. A segment starts and ends with a core gene that is connected to more than two other core genes across all input Gffs. The reason for core genes being connected to more than two other core genes indicate some type of rearrangement has occurred. Segments are named by the core genes at each segment's end, and each core gene is given an index with segments.

no_accessory_core_segments.tsv

This file is somewhat linked to the core_segments.tsv. In this file segments from the core_segments.tsv file above is divided into sub-segments. The rule used to break down segments from core_segments.tsv is that no accessory genes must be found inserted between core pairs in the sub-segments. Therefore, each sub-segment starts and ends with a core genes, which has accessory genes inserted upstream in at least one input Gff. The names of segments in core_segments.tsv can be found in this file as each segment is made up of sub-segments. Sub-segments are named by core genes at both ends, and given a unique index within their parent 'stable' segment. Each core gene in the sub-segment is also given an index.

coreless_contig_accessory_gene_content.tsv

Here any contig that appears in a genome but does not encode any core gene is reported. This means that the file will only be produced if at least one contig is without a core gene. This could be in the event of a plasmid, not found in large enough numbers of genomes, or a prophage not being assembled. This file could in some cases be of use to identify contigs of potential interest. Each entry is given by: contig name, Gff file name, total accessory genes, and accessory genes split into intermediate- and low-frequency categories.