DB DIY – Wiki - gvignolle/FunOrder GitHub Wiki

FunOrder 2 - DB DIY

How to choose proteomes

The choice of the proteomes depends on the degree of phylogenetic depth desired. If one is interested in the co-evolution within a specific phylum, the number of proteomes representing a single genus should be reduced to below 20 (better 5-10). This is due to the reason of a maximum of 20 homologous sequences used for inferring the phylogenetic tree. In this case, too many closely related proteomes might interfere with the goal of observing a more broad view of the protein family co-evolution. Further, proteomes should be chosen based on high quality gene predictions (functional annotations are not necessary), preferably based on RNAseq data, and considering the phylogenetic distance and the degree of tree of life representation. The proteomes need to be in the format of protein fasta files.

See our publications for further details on how to choose proteomes:

FunOrder 2.0 – a fully automated method for the identification of co-evolved genes Gabriel A Vignolle, Robert L Mach, Astrid R Mach-Aigner, Christian Derntl bioRxiv 2022.01.10.475597; doi: https://doi.org/10.1101/2022.01.10.475597

How to prepare proteomes

After choosing the set of proteomes (in fasta format) they have to be prepared before they can be combined to a database compatible with FunOrder 2.0. The first step is to give each proteome a unique identifier, by still preserving a uniqueness for each amino acid sequence. This can be achieved with following command:

gawk -i inplace ‘/^>/{gsub(/^>/,”>YourIdentifier_”++i” “);}1’ proteome_01.fasta

This command inserts “YourIdentifier_1 ” after the first “>” in the fasta file and “YourIdentifier_2 ” after the second “>” and so on. The space after the insert is important for the database creation. It is important that you only change “YourIdentifier” to a unique identifier for each chosen proteome. The next step is to remove duplicates within each proteome separately. A Perl script provided with the FunOrder tool using following command can achieve this:

perl ~/funorder_2.0/removeduplicates.pl proteome_01.fasta

How to combine proteomes

After this step all proteomes can be concatenated to a single file with a simple command:

cat *proteome* > new_db.fasta

Then all ”|” in the new_db have to be substituted with “_”. This can be done with following command:

sed ‘s,|,_,g’ -i new_db.fasta

How to make a working database

At this point, the database has to be formatted for BLAST and DIAMOND. First move the database to the folder /funorder_2.0/db then make a blast database with sequence identifiers:

makeblastdb -in new_db.fasta -dbtype prot -parse_seqids

Then make a DIAMOND database with the same name as the fasta file used.

diamond makedb --in new_db.fasta --db new_db

It is important to place your new_db in the /funorder_2.0/db folder, as it is searched in it by default.

How to test your database

Now you can run FunOrder 2.0 with your new database:

sh funorder.sh 20 genbank_file.gbk /home/exact/path/to/outputfolder new_db

To be able to test your database for your specific needs a set of known, expected results should be provided, backed by the respective literature. If this is not possible, extra care must be invested when choosing the proteomes for the database. We provide a set of known analyzed secondary metabolite biosynthetic gene clusters (BGC) from ascomycetes, different sets of primary metabolism proteins, negative control gene clusters (GC) and sequential genes from random locations also from ascomycetes.

See our publications for further details on how to test your database:

Vignolle GA, Schaffer D, Zehetner L, Mach RL, Mach-Aigner AR, Derntl C (2021) FunOrder: A robust and semi-automated method for the identification of essential biosynthetic genes through computational molecular co-evolution. PLoS Comput Biol 17(9): e1009372. doi: https://doi.org/10.1371/journal.pcbi.1009372