Databases Conformation - nselem/evomining GitHub Wiki
EvoMining databases conformation
This is a tutorial about how to conform EvoMining main databases
- Central DB
- Natural products DB
- Genomic DB
- AntiSMASH DB (optional)
EvoMining inputs are: (1) a custom genomic database (genomic-DB), (2) a central pathways database (central-DB) and (3) a natural product database (natural-DB) composed of genes that belongs to experimentally tested BGCs. These three databases are provided and can be modified replaced and expanded by the user. The genomic-DB is a collection of genomes in RAST format from taxonomically related organisms. The current central-DB contains central pathways from Actinobacteria previously curated (Barona-Gómez, et al. 2012, Cruz-Morales et al. 2016 ). The present natural-DB comprises all sequences that belongs to some BGCs from The Minimum Information about a Biosynthetic Gene cluster (MIBiG) (Medema et al. 2015).
Central DB
Central paths headers:
SUBSYSTEM|Family number|Function_querynumber|Organism or comment.
SUBSYSTEM
Subsystem refers to metabolic subsystem. Each subsystem may have many Family numbers
Family number
Defines columns on heatplot Each family may have many queries, only one hit query relationship will be shown at the heatplot i.e. only non redundant hits will be considered.
Function_querynumber
Function explains the function of the family.
querynumber a unique consecutive identifier.
Organism or comment
Organism name or comment.
Example
3PGA_AMINOACIDS|1|Phosphoglyceratedehydrogenase_1|Cglu
EvoMining curated Actinobacteria central-DB is available. This central-DB contains enzymes from 9 metabolic subsystems.
get the Actinobacteria central DB by wget https://github.com/nselem/EvoMining/blob/master/databases/central-DB-Actinobacteria
Natural Products
Currently using MiBig database http://mibig.secondarymetabolites.org/
natural-DB
Natural products DB contain enzymes that belong to a biosynthetic pathway of any metabolic pathway of interest. Natural products DB must contain a fasta file format with a special header and a one-line protein sequence as in the example above:
>BGC0000001_AADALKB
MNAPVHVDQNFEEVINAARSMREIDRKRYLWMISPALPVIGIGILAGYQFSPRPIKKIFALGGPIVLHIIIPVIDTIIGKDASNPTSEEIKQLENDPY
A multi-line fasta can correct to a single-line fasta with the next script:
sed ':a;N;$!ba;s/\n/\t/g' file_to_correct | sed 's/\t/\n/' | sed ':a;N;$!ba;s/\t//g' > corrected_file
Genomic DB
Actinobacteria database and RastIds files is available at Zenodo:
Rast Ids
Download the full Actinobacteria genomic DB and add you genome at the end
antiSMASH DB (optional)
antiSMASH DB antiSMASH database contains information about the genes related with the secondary metabolism according to the annotation of the program antiSMASH (https://antismash.secondarymetabolites.org/#!/start). The use of this database may help you in your analyses, since BGCs predicted by this tool will be colored in the final tree of the EvoMining pipeline.
In order to construct this database, it is needed to firstly annotate your genomes with antiSMASH. It does not matter whether you use the web or the local version of the program. The following steps have been tested with the output of the version 5 of antiSMASH. It is possible that the output of previous antiSMASH versions need some extra step.
The input for antiSMASH should be the genome in .gbk format and should be labelled with the Job ID (e.g. 81925.gbk). antiSMASH will generate an output folder for each genome containing all the BGCs predicted with their annotations. In the web version you can download this folder with the option “Download”, at the top of the webpage.
Once you have all the output folders within the same directory, you will need to execute the script antiSMASH_DB.pl
as follows: perl antiSMASH_DB.pl > antiSMASH_DB
Then, a file called antiSMASH_DB will be generated. This is a tab delimited file which contains the following information: the first column will contain the RAST Job IDs of the genomes. The second column will be the genome ID followed by the gene (peg) ID (e.g. 6666666.285189.145, where “145” is the ID of this gene). The third column gives information about the type of BGC predicted. The fourth column will be the name of the different BGC.gbk files produced by antiSMASH.
Place this file in the working directory and execute EvoMining with the flag -a antiSMASH_DB