Function: gbToIMG - g-e-kenney/prettyClusters GitHub Wiki

`gbToIMG`

This accessory function replaces generateNeighbors. It takes a list of genes of interest and a directory of GenBank files with paired .faa files of amino acid sequences and makes fake IMG-formatted metadata files for them, with the goal of making non-IMG files more compatible with the rest of this toolset. (This includes the generation of faux IMG-style gene_oids, to avoid the heterogeneity of locus_tag numbering.)

Files from the ENA (note: must be converted from EMBL to GenBank) or the NCBI databases often contain less information on the annotation sources used to assign genes, so expectations should be tempered accordingly, and the same goes for some common annotation tools such as RAST. Supplementing the annotation with the incorpIprScan accessory tool is recommended if you have access to a local or cluster-based InterProScan install. (antiSMASH-generated .gbk files will provide protein family info, but depending how you generated them, you will want to check and make sure that locus_tags/protein_ids match whatever you used to find the original sequences.)

After running, check the metadata - it may require some tidying. Common issues: mangling/loss of strain names (will make things confusing when analyzing data), occasionally column alignment issues (esp. if you're concatenating output with direct-from-IMG metadata - they sometimes add new columns), etc.

Use of `gbToIMG`

Note: your directory should include paired genomes (.gb/.gbk) and amino acid sequence fasta files (.faa), and you'll need a list of locus_tags or protein_ids for your genes of interest. See the workflow if you are unsure how to prep these.

Once that's done, a basic run will look like this:

gbToIMGOutput <- gbToIMG(dataFolder="/user/data/gbfiles", goiListInput = "20210101_genE_goiList.txt", neighborNum=10, geneName="genE")

This is a fine default, but make sure you adjust your neighborhood size as desired. I'm not using some of the fancier options here, like specifying starting IDs for the genome, gene, or scaffold (which you might want to do if you plan to combine datasets from separate gbToIMG runs.)

Required inputs

dataFolder Folder path. For folder containing all .gb/.gbk files to be analyzed and their paired .faa files (same name, different suffix).
goiListInput Filename. For a text file containing a list of genes of interest by the names they are likely to be identified by in their GenBank files (probably locus_tag). Names are provided on single lines, with a header.
geneName Character string. Name of gene family of interest (purely for file naming).

Advanced options

neighborNum Integer. Number of neighbors to be provided for each gene of interest. Defaults to 10.
removeDupes Boolean. Removes duplicated entries (probably an OK default, unless you have copies of genomes with different annotations that you want evaluated independently. Defaults to TRUE.
scaffoldGenBase Integer. Initial value for generating IMG-style faux scaffold IDs and gene_oids (new IDs are generated for each run currently). Defaults to 30000000000 (an order of magnitude bigger than IMG IDs).
genomeGenBase Integer. Initial value for generating IMG-style faux genome IDs (new IDs are generated for each run currently). Defaults to 40000000000 (an order of magnitude bigger than IMG IDs).
includeIPR Boolean. Specifies whether or not InterPro family information (if any) should be extracted. Defaults to FALSE since it's not yet an IMG default.
seqExtract Boolean. Specifies whether or not you want to take in paired amino acid sequences and generated multisequence .fa files for your genes of interest and for their neighbors with their new gene_oids. Defaults to TRUE since you'll want these for later steps.

Output

20210101_gb2img_genE_geneSeqs.fa File. Fasta-formatted file for the protein sequences of genes of interest with simplified headers containing only the IMG-style gene_oids.
20210101_gb2img_genE_neighborSeqs.fa File. Fasta-formatted file for the protein sequences of neighbors of genes of interest with simplified headers containing only the IMG-style gene_oids.
20210101_gb2img_genE_neighborContext.txt File. Tab-delimited table with three columns: gene_oid (the neighbor gene_oid), source_gene_oid (the gene_oid for which the neighbor was generated) and scaffold_id (the scaffold on which the original gene_oid was found.)
20210101_gb2img_genE_geneData.txt File. Tab-delimited metadata table, IMG-styled, for genes of interest.
20210101_gb2img_genE_neighborData.txt File. Tab-delimited metadata table, IMG-styled, for neighbors of genes of interest.
gbToIMGOutput List. Contains gbToIMGOutput$geneData (a data frame of IMG-styled metadata for genes of interest), gbToIMGOutput$neighborData (a data frame of IMG-styled metadata for neighbors of genes of interest), and gbToIMGOutput$neighborsContext (a data frame connecting neighbors to the genes of interest they are associated with, along with their scaffolds).