Quick start - milnus/Corekaburra GitHub Wiki

Overview

Corekaburra looks at gene synteny across genomes used to construct a pan-genome. Using syntenic information Corekaburra identifies regions between core genes. Regions are described based on their content of accessory genes and nucleotides between core genes. Information on neighbouring core genes is further used to identify segments of core genes throughout the pan-genome that appear in all genomes given as input. Corekaburra is compatible with outputs from standard pan-genome pipelines: Roary and Panaroo, but can be extended to other pipelines (see: inputs).

Why and When to use Corekaburra

Corekaburra fits into the existing frameworks of bioinformatics pipelines for pan-genomes. It does not reinvent a new pan-genome pipeline, but leverages the existing ones. Because of this, Corekaburra is built as a natural extension of pan-genome analyses by summarising information and inferring relationships in the pan-genome otherwise not easily accessible via pan-genome graphs. Other tools provide similar outputs or information, but in their own standalone pan-genome analysis framework or pipeline. Such frameworks/pipelines are PPanGGolin and Panakeia. By building on top of existing tools Corekaburra frees users from potentially cross referencing between pan-genomes, which in itself is a challenging task. Corekaburra's workflow also allows it to be extended to any pan-genome tool, with an output similar to the gene_presence_absence.csv produced by Roary, making Corekaburra versatile for future implementations.

Installation

Corekaburra is written in Python 3.9, and can be installed via pip and conda. A Docker container is also available.

Conda install

conda install -c bioconda corekaburra

pip

pip install corekaburra

Docker

See the (Docker tab)[https://github.com/milnus/Corekaburra/wiki/Docker.md] for more

Dependencies:

python==3.9

Python packages:

networkx
gffutils
numpy

Help

usage: Corekaburra -ig file.gff [file.gff ...] -ip path/to/pan_genome [-cg complete_genomes.txt] [-cc 1.0] [-lc 0.05] [-o path/to/output] [-p OUTPUT_PREFIX] [-c int] [-l | -q] [-h] [-v]

Welcome to Corekaburra! An extension to pan-genome analyses that summarise genomic regions between core genes and segments of neighbouring core genes using gene synteny from a set of input genomes and a pan-
genome folder.

Required arguments:
  -ig file.gff [file.gff ...], --input_gffs file.gff [file.gff ...]
                        Path to gff files used for pan-genome
  -ip path/to/pan_genome, --input_pangenome path/to/pan_genome
                        Path to the folder produced by Panaroo or Roary

Analysis modifiers:
  -cg complete_genomes.txt, --complete_genomes complete_genomes.txt
                        text file containing names of genomes that are to be handled as complete genomes
  -cc 1.0, --core_cutoff 1.0
                        Percentage of isolates in which a core gene must be present [default: 1.0]
  -lc 0.05, --low_cutoff 0.05
                        Percentage of isolates where genes found in less than these are seen as low-frequency genes [default: 0.05]

Output control:
  -o path/to/output, --output path/to/output
                        Path to where output files will be placed [default: current folder]
  -p OUTPUT_PREFIX, --prefix OUTPUT_PREFIX
                        Prefix for output files, if any is desired

Other arguments:
  -c int, --cpu int     Give max number of CPUs [default: 1]
  -l, --log             Record program progress in for debugging purpose
  -q, --quiet           Only print warnings
  -h, --help            Show help function
  -v, --version         show program's version number and exit

Inputs

Gff files

Input Gff files must be included in the pan-genome gene_presence_absence.csv-style file, and are identified by name in the gene_presence_absence.csv file and name of the Gff file.
Gffs are also required to have a ##FASTA dividing the file into annotations at the top and a Fasta genome at the bottom of the file.
All coding sequences (CDS) annotated in the GFF must also carry an ID and a locus_tag.
Input Gff files can be in gzipped format, if desired.

Pan-genome folder

This is the output folder from a Roary or Panaroo run, or a folder that at minimum holds the gene_presence_absence.csv from Roary or the gene_presence_absence_roary.csv from Panaroo. (for more and making you work pan-genome pipeline input see: inputs)

Complete genomes

If some input Gff are to be processed as complete or closed genomes, a plain text file can be provided with the filename of these.
example:

complete_genome.gff
complete_genome.gff.gz
/paths/are/allowed/complete_genome.gff
complete_genome

All files given in the plain text file must be present in a given gene presence/absence file, but are not required to be among the input gffs. This allows users to have a single plain text file of complete genomes and use it for analysing subsets of genomes in the pan-genome.

Adjusting cutoffs

To comply with common practice when handling pan-genomes, the cutoff for when a gene is perceived as core can be changed using the -cc arguments with a ratio of gene presence required. By default, this is set to a conservative 100% (1.0) presence of core genes.
A second argument dividing accessory genes into two groups (Low frequency and Intermediate frequency) can be controlled using the -lc argument. -lc takes a ratio indicating the maximum presence of a gene to be identified as having a low frequency in the pan-genome. This division of low- and intermediate frequency can be disabled by: -lc 0, resulting in all genes being considered as intermediate.

Outputs

Corekaburra outputs multiple files ranging from summaries to more fine grained outputs. This is aimed at giving the user easy access to information, but still allowing for tailored or deep exploration.

Core regions

A Core region is defined by two core genes flanking a stretch of the genome in at least one input genome (Gff). A core region can be described by a distance between the flanking core genes, positive if nucleotides can be found between them, and negative if the two genes overlap. Regions can also be described by the number of encoded accessory genes, low- and intermediate frequency. Using core genes as reference for regions it is possible to compare the same region across genomes in the framework of a pan-genome. Additionally, with either or both the distance and number of encoded accessory genes in a region it is possible to identify regions of variability.

core_pair_summary.csv summarise the identified core regions across input Gffs. Here information about occurrence and co-occurence of each core gene pair, and individual core gene's occurrence can be found. Distance and accessory gene summary statistics (minimum, maximum, mean, and median) are given for each core pair.
This file is a good entry point to results in most analyses, and should give a good indication of which core regions that could be of interest.

core_core_accessory_gene_content.tsv gives the placement of each accessory gene identified in core regions across all Gffs. It is also given if the accessory gene is identified as a low- or intermediate frequency gene.

low_frequency_gene_placement.tsv holds one line for each core region across all Gffs with the nucleotide distance, and number of accessory genes found between core genes.

Core segments

The following two files are only given if any core gene is found to have more than two unique core genes as neighbours across all input Gffs.

core_segments.csv containing all segments of minimum two core genes identified the pan-genome. The start and end of a segment is defined by core genes with more than two unique neighbours, meaning they could be a potential breakpoint of a genomic inversion in at least one input Gff, or a misassembly.

no_accessory_core_segments.csv divides segments given in core_segments.csv into smaller segments where core genes must form regions with no accessory genes between them across all Gffs. These segments could indicate potential operon structures, or other stable genomic features.

Core-less contigs

core_segments.csv gives all contigs identified in Gffs that do not contain a core gene, but only accessory genes. Each contig is given by contig name, its Gff file name, and number of low- and intermediate frequency genes found on the contig.

Post processing

This highly depends on the question you are asking your data. Some general ideas, tips, and tricks for downstream analyses have been collated in this wiki. If you find new things to look at, or ways of doing things let us know and we will look at incorporating them here.

Bug reporting and feature requests

Please submit bug reports and feature requests to the issue tracker on GitHub: Corekaburra issue tracker

Licence

This program is released as open source software under the terms of MIT License.