Databases Conformation - nselem/evomining GitHub Wiki

EvoMining databases conformation

This is a tutorial about how to conform EvoMining main databases

EvoMining inputs are: (1) a custom genomic database (genomic-DB), (2) a central pathways database (central-DB) and (3) a natural product database (natural-DB) composed of genes that belongs to experimentally tested BGCs. These three databases are provided and can be modified replaced and expanded by the user. The genomic-DB is a collection of genomes in RAST format from taxonomically related organisms. The current central-DB contains central pathways from Actinobacteria previously curated (Barona-Gómez, et al. 2012, Cruz-Morales et al. 2016 ). The present natural-DB comprises all sequences that belongs to some BGCs from The Minimum Information about a Biosynthetic Gene cluster (MIBiG) (Medema et al. 2015).

Central DB

Central paths headers:

SUBSYSTEM|Family number|Function_querynumber|Organism or comment.

SUBSYSTEM

Subsystem refers to metabolic subsystem. Each subsystem may have many Family numbers

Family number

Defines columns on heatplot Each family may have many queries, only one hit query relationship will be shown at the heatplot i.e. only non redundant hits will be considered.

Function_querynumber

Function explains the function of the family.
querynumber a unique consecutive identifier.

Organism or comment

Organism name or comment.

Example

3PGA_AMINOACIDS|1|Phosphoglyceratedehydrogenase_1|Cglu

EvoMining curated Actinobacteria central-DB is available. This central-DB contains enzymes from 9 metabolic subsystems.
get the Actinobacteria central DB by wget https://github.com/nselem/EvoMining/blob/master/databases/central-DB-Actinobacteria

Natural Products

Currently using MiBig database http://mibig.secondarymetabolites.org/
natural-DB

Natural products DB contain enzymes that belong to a biosynthetic pathway of any metabolic pathway of interest. Natural products DB must contain a fasta file format with a special header and a one-line protein sequence as in the example above:

>BGC0000001_AADALKB

MNAPVHVDQNFEEVINAARSMREIDRKRYLWMISPALPVIGIGILAGYQFSPRPIKKIFALGGPIVLHIIIPVIDTIIGKDASNPTSEEIKQLENDPY

A multi-line fasta can correct to a single-line fasta with the next script:

sed ':a;N;$!ba;s/\n/\t/g' file_to_correct | sed 's/\t/\n/' | sed ':a;N;$!ba;s/\t//g' > corrected_file

Genomic DB

Actinobacteria database and RastIds files is available at Zenodo: DOI

Rast Ids
Download the full Actinobacteria genomic DB and add you genome at the end


evomining
SzbLaft1

antiSMASH DB (optional)

antiSMASH DB antiSMASH database contains information about the genes related with the secondary metabolism according to the annotation of the program antiSMASH (https://antismash.secondarymetabolites.org/#!/start). The use of this database may help you in your analyses, since BGCs predicted by this tool will be colored in the final tree of the EvoMining pipeline.

In order to construct this database, it is needed to firstly annotate your genomes with antiSMASH. It does not matter whether you use the web or the local version of the program. The following steps have been tested with the output of the version 5 of antiSMASH. It is possible that the output of previous antiSMASH versions need some extra step.

The input for antiSMASH should be the genome in .gbk format and should be labelled with the Job ID (e.g. 81925.gbk). antiSMASH will generate an output folder for each genome containing all the BGCs predicted with their annotations. In the web version you can download this folder with the option “Download”, at the top of the webpage.

Once you have all the output folders within the same directory, you will need to execute the script antiSMASH_DB.pl as follows: perl antiSMASH_DB.pl > antiSMASH_DB Then, a file called antiSMASH_DB will be generated. This is a tab delimited file which contains the following information: the first column will contain the RAST Job IDs of the genomes. The second column will be the genome ID followed by the gene (peg) ID (e.g. 6666666.285189.145, where “145” is the ID of this gene). The third column gives information about the type of BGC predicted. The fourth column will be the name of the different BGC.gbk files produced by antiSMASH.

antiSMASH-DB

Place this file in the working directory and execute EvoMining with the flag -a antiSMASH_DB