Building your own model - Gardner-BinfLab/invasive_salmonella GitHub Wiki
All of the code you will need to achieve this is located in the build_random_forest directory.
To build a model, you need FASTA files of protein coding genes for each of your isolates of interest, and orthology relationships between the proteins from your isolates of interest (we have designed the model building script to take output from Roary).
To generate the FASTA files for a large number of isolates, we have provided a script (vcfandgff_to_fasta.pl) which can take a VCF file for each isolate and the GFF file for the reference genome it was mapped to for SNP calling, and produce edited protein sequences for analysis. Usage below:
./vcfandgff_to_fasta.pl <vcf file> <gff file> <prefix for FASTA files>
It is important to make sure that any sequences with internal stop codons are trimmed, because HMMER will ignore the stop codon and keep scoring residues past this point as if they are translated. We have included a script to do this:
./trimtruncations.pl <FASTA file>
You will need to run an hmmsearch over each of your proteome files and put each output file in a folder called "search".
We recommend using an HMM database from eggNOG (http://eggnogdb.embl.de/) appropriate for your taxonomic group, e.g. we use the gammaproteobacterial models for our Salmonella analysis. Download the full set of HMMs from the website then concatenate all .hmm files into a single hmm database (in this case called eggNOG.hmm):
cat xNOG.hmm/*.hmm > eggNOG.hmm
hmmpress eggNOG.hmm
mkdir search
# for each isolate:
hmmsearch --domtblout search/<strain_id>.search eggNOG.hmm <FASTA file>.trim
Next, you will need to use your hmmsearch output and Roary ortholog calls to put together a table of bitscores for each of your orthologous groups:
./parse_bitscores.pl <Roary gene_presence_absence.csv file path> search
Once you have your table of bitscores, you're ready to run the build_rf.Rmd R notebook. It has instructions about how to run each step.
There is some test data already provided in the build_random_forest directory to help you check that your analysis is running as expected. This dataset is similar to that used in our original analysis, but uses Roary to call orthologs rather than the curated ortholog list we used.
Note: in our model building workflow we set missing values in our dataset (corresponding to orthologous genes that weren't identified for a given genome in the training dataset) as zeroes, corresponding to a complete absence of that gene. If your genomes are of low quality, this is not advisable. Instead, we suggest you impute missing values (the na.roughfix() function from the caret package is a fast option for this). If there are a large number of genuine gene deletions in your dataset that are ignored, this will impact the ability of your model to identify the most severely degraded genes in your dataset, so impute with caution.