DB DIY – Wiki - gvignolle/FunOrder GitHub Wiki
FunOrder 2 - DB DIY
How to choose proteomes
The choice of the proteomes depends on the degree of phylogenetic depth desired. If one is interested in the co-evolution within a specific phylum, the number of proteomes representing a single genus should be reduced to below 20 (better 5-10). This is due to the reason of a maximum of 20 homologous sequences used for inferring the phylogenetic tree. In this case, too many closely related proteomes might interfere with the goal of observing a more broad view of the protein family co-evolution. Further, proteomes should be chosen based on high quality gene predictions (functional annotations are not necessary), preferably based on RNAseq data, and considering the phylogenetic distance and the degree of tree of life representation. The proteomes need to be in the format of protein fasta files.
See our publications for further details on how to choose proteomes:
FunOrder 2.0 – a fully automated method for the identification of co-evolved genes Gabriel A Vignolle, Robert L Mach, Astrid R Mach-Aigner, Christian Derntl bioRxiv 2022.01.10.475597; doi: https://doi.org/10.1101/2022.01.10.475597
How to prepare proteomes
After choosing the set of proteomes (in fasta format) they have to be prepared before they can be combined to a database compatible with FunOrder 2.0. The first step is to give each proteome a unique identifier, by still preserving a uniqueness for each amino acid sequence. This can be achieved with following command:
gawk -i inplace ‘/^>/{gsub(/^>/,”>YourIdentifier_”++i” “);}1’ proteome_01.fasta
This command inserts “YourIdentifier_1 ” after the first “>” in the fasta file and “YourIdentifier_2 ” after the second “>” and so on. The space after the insert is important for the database creation. It is important that you only change “YourIdentifier” to a unique identifier for each chosen proteome. The next step is to remove duplicates within each proteome separately. A Perl script provided with the FunOrder tool using following command can achieve this:
perl ~/funorder_2.0/removeduplicates.pl proteome_01.fasta
How to combine proteomes
After this step all proteomes can be concatenated to a single file with a simple command:
cat *proteome* > new_db.fasta
Then all ”|” in the new_db have to be substituted with “_”. This can be done with following command:
sed ‘s,|,_,g’ -i new_db.fasta
How to make a working database
At this point, the database has to be formatted for BLAST and DIAMOND. First move the database to the folder /funorder_2.0/db then make a blast database with sequence identifiers:
makeblastdb -in new_db.fasta -dbtype prot -parse_seqids
Then make a DIAMOND database with the same name as the fasta file used.
diamond makedb --in new_db.fasta --db new_db
It is important to place your new_db in the /funorder_2.0/db folder, as it is searched in it by default.
How to test your database
Now you can run FunOrder 2.0 with your new database:
sh funorder.sh 20 genbank_file.gbk /home/exact/path/to/outputfolder new_db
To be able to test your database for your specific needs a set of known, expected results should be provided, backed by the respective literature. If this is not possible, extra care must be invested when choosing the proteomes for the database. We provide a set of known analyzed secondary metabolite biosynthetic gene clusters (BGC) from ascomycetes, different sets of primary metabolism proteins, negative control gene clusters (GC) and sequential genes from random locations also from ascomycetes.
See our publications for further details on how to test your database:
FunOrder 2.0 – a fully automated method for the identification of co-evolved genes Gabriel A Vignolle, Robert L Mach, Astrid R Mach-Aigner, Christian Derntl bioRxiv 2022.01.10.475597; doi: https://doi.org/10.1101/2022.01.10.475597
Vignolle GA, Schaffer D, Zehetner L, Mach RL, Mach-Aigner AR, Derntl C (2021) FunOrder: A robust and semi-automated method for the identification of essential biosynthetic genes through computational molecular co-evolution. PLoS Comput Biol 17(9): e1009372. doi: https://doi.org/10.1371/journal.pcbi.1009372