Function: prepNeighbors - g-e-kenney/prettyClusters GitHub Wiki

`prepNeighbors`

prepNeighbors takes input from generateNeighbors or gbToIMG/incorpIprScan, i.e. IMG-formatted metadata files for genes of interest and their neighbors, along with fasta-formatted files for their predicted protein sequences. It's actually a wrapper for two subfunctions:

neighborHypothetical This tool identifies sets of hypothetical proteins that are overrepresented in the genomic neighborhood of the genes of interest. Beyond R packages, this tool uses the NCBI blast+ package and MAFFT, which need to be pre-installed. Note that for identifying subgroups of hypothetical proteins, an all-by-all blast step is used, which can lead to long computation time on large datasets! Tidygraph (with ggraph) and/or pvclust are used identify and visualize possible cutoff points for defining those subgroups of hypothetical proteins. For computational reasons, tidygraph is likely to be quicker for large datasets.
neighborTrim This tool identifies genomic neighborhoods where contig ends result in small neighborhoods. Since truncated neighborhoods can mis-weight genome neighborhood clustering, they can optionally be removed from further analyses. Additionally, ostensible neighbors that aren't actually on the same scaffold as the gene of interest are removed.

Use of `prepNeighbors`

A minimal run, just doing basic quality control and trimming any false neighbors that aren't on the same contig/scaffold as our gene of interest.

prepNeighborsOut <- prepNeighbors(imgGenes = "20210101_img_genE.txt", imgNeighbors = "20210101_img_genE_neighbors.txt", geneSeqs = "20210101_generateNeighbors_genE.fa", neighborSeqs = "20210101_img_genE_neighbors.fa", neighborsContext = "20210101_generateNeighbors_genE_context.txt", geneName = "genE", neighborNumber = 10)

A fancier one, in which hypothetical proteins and all peptides (<175 aa) are identified (via an all-by-all blast & some basic clustering/network analysis) and annotated, MAFFT alignments of hypothetical protein subgroups are produced, and truncated contigs (where the number of neighbors is incomplete) are removed. Also specified: the filesystem (to get BLAST & MAFFT to run right) & the number of processor threads to be used (to speed up the run.)

prepNeighborsOut <- prepNeighbors(imgGenes = "20210101_img_genE.txt", imgNeighbors = "20210101_img_genE_neighbors.txt", geneSeqs = "20210101_generateNeighbors_genE.fa", neighborSeqs = "20210101_img_genE_neighbors.fa", neighborsContext = "20210101_generateNeighbors_genE_context.txt", geneName = "genE", neighborNumber = 10, sysTerm = "nix", hypoAnalysis = TRUE, clustMethod = "tidygraph", numThreads = 7, trimShortClusters = TRUE, pepScreen=TRUE, pepMax = 175, alnClust=TRUE)

Additional advanced settings are available!

Required options

These must be provided for all runs.

imgGenes Filename. IMG-formatted metadata file for genes of interest. Required.
imgNeighbors Filename. IMG-formatted metadata file for neighbors of genes of interest. Required.
geneSeqs Filename. FASTA-formatted sequence file for genes of interest. Required. The version of this file created by generateNeighbors trims sequence names down to the gene_oid and can be used for input - annotations in FASTA names occasionally produce errors.
neighborSeqs Filename. FASTA-formatted sequence file for neighbors of genes of interest. Required.
neighborsContext Filename. File generated in generateNeighbors that ties the metadata of the neighbors to the gene_oids and scaffold IDs of the original genes of interest. Required.
geneName Text. the name of your gene family of interest - used for filenames and figure labels. Required.
neighborNum Integer. The number of neighboring genes on either side of the gene of interest that will be investigated. Required.

Advanced options

efiRepnodes Boolean. Just used for filename generation. Defaults to FALSE.

Options related to trimming clusters on truncated contigs

trimShortClusters Boolean. Signals whether or not to remove gene clusters where a truncated contig interrupts the cluster and a sub-sized cluster remains. (Recommended when running further analyses and not just generating cluster diagrams.) Defaults to FALSE.

Options related to identifying hypothetical or custom protein families

hypoAnalysis Boolean. Signals whether or not subsets of similar hypothetical proteins will be identified. Defaults to FALSE.
sysTerm Text. Used to identify the sort of terminal (current options: "wsl" for the Linux subsystem on Windows and "nix" for generic Linux, MacOS, etc.) Used to run blast and mafft commands. Required for protein family analysis. Defaults to "nix".
numThreads Integer. Number of threads to use during analysis. Maximum value depends on your computer (bigger is better). Defaults to 1.
neighborThreshold Number. (0-1.00). Fraction of gene clusters in which a protein family must occur to be included in further analyses. Broadly, used to determine how widely the net for subsets of hypothetical proteins is cast. Defaults to 0.025 (2.5%).
clustMethod Text. The name of the package to be used to identify clusters of similar hypothetical proteins. Tidygraph and pvclust are the current options, with the former currently suggested. Defaults to "tidygraph".
pidCutoff Number. (0-100). BLAST %ID cutoff to be used for edge generation when analyzing protein families using tidygraph. Defaults to 35 (35% ID).
pepScreen Boolean. Specifies whether or not to assign provisional families to all proteins below a size cutoff regardless of annotation status. Defaults to FALSE.
pepMax Number. Specifies the size cutoff (aa) for pepScreen - increase with caution on large datasets, since this uses an all-by-all blast. Defaults to 150.
matchLength Number (0-1). BLAST matches need to above this fraction of of the length of the smaller seq. Defaults to .65; can be appropriate to use a lower cutoff when working with things like RiPP precursor peptides, where some regions are expected to have significant sequence diversity.
alnClust Boolean. Specifies whether or not to make MAFFT multiple sequence alignments for hypothetical protein or peptide families. Defaults to FALSE.
hmmClust Boolean. NOT IMPLEMENTED YET. Specifies whether or not to make an HMM for protein or peptide families using a randomly selected subset of sequences. Defaults to FALSE.
alphaVal Number. (0.0-1.0). Alpha value cutoff for analyzing pvClust data (likely to be deprecated). Defaults to 0.95.
bootStrap Number. Bootstrap number for pvClust (likely to be deprecated). Can be fairly computationally intensive: 100 is a good starting point for small datasets (e.g. 100 genes of interest, 5 genes on either size), and 10 for larger (500 genes of interest or more, 10+ genes in either direction.) Defaults to 10.

Output for next function

20210101_neighborTrim_genE_genes.txt File. IMG-formatted tab-delimited metadata text file of genes of interest with genes from short contigs removed.
20210101_neighborTrim_genE_neighbors.txt File. IMG-formatted tab-delimited metadata text file of neighbors of genes of interest with genes from short contigs or from incorrect scaffolds. Contains Hypofam annotations if neighborHypothetical was run.
20210101_neighborTrim_genE_neighborsContext.txt File. Tab-delimited list of IMG gene_oids for neighbors, their source genes, and the scaffolds of the source genes, with neighbors from short contigs or on incorrect scaffolds removed.
20210101_neighborTrim_genE_geneSeqs.fa File. FASTA-formatted set of amino acid sequences, trimmed to match the updated gene list.
20210101_neighborTrim_genE_neighborSeqs.fa File. FASTA-formatted set of amino acid sequences, trimmed to match the updated neighbor list.

Additional output

When hypothetical protein families are assigned by tidygraph

20210101_neighborHypothetical_genE_hypoSeqs.fa File. FASTA file containing amino acid sequences of all putative hypothetical proteins
20210101_neighborHypothetical_genE_blastFile.txt File. All-by-all BLAST results for hypothetical sequences in a tab-delimited format
20210101_neighborHypothetical_genE_clusterList.txt File. Tab-delimited text file with two columns listing the gene_oid of every hypothetical protein and its cluster group (entered in the annotation as a "Hypofam" family in a "Hypofam" column.)
20210101_neighborHypothetical_genE_networkFull_xx.pdf File. "xx" is the percent ID cutoff. Vector image (PDF) of full network, with the gene of interest highlighted in red (if they are unannotated)
20210101_neighborHypothetical_genE_networkClusters_xx.pdf File. Vector image of tidygraph network at the percent ID cutoff of "xx" with small clusters and unclustered nodes removed.
20210101_neighborHypothetical_genE_tgCluster_x.fa File. In hypoClusters subfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster. "x" is the cluster number.
20210101_neighborHypothetical_genE_tgCluster_x_mafft.fa File. In hypoClusters subfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster, MAFFT-aligned. "x" is the cluster number.

When peptide families are assigned using tidygraph

20210101_neighborHypothetical_genE_pepSeqs.fa File. FASTA file containing amino acid sequences of all putative hypothetical proteins
20210101_neighborHypothetical_genE_pepBlast.txt File. All-by-all BLAST results for hypothetical sequences in a tab-delimited format
20210101_neighborHypothetical_genE_pepClusterList.txt File. Tab-delimited text file with two columns listing the gene_oid of every hypothetical protein and its cluster group (entered in the annotation as a "Pepfam" family in the "Hypofam" column.)
20210101_neighborHypothetical_genE_pepNetworkFull_xx.pdf File. "xx" is the percent ID cutoff. Vector image (PDF) of full network, with the gene of interest highlighted in red (if they are unannotated)
20210101_neighborHypothetical_genE_pepNetworkClusters_xx.pdf File. Vector image of tidygraph network at the percent ID cutoff of "xx" with small clusters and unclustered nodes removed.
20210101_neighborHypothetical_genE_tgCluster_x.fa File. In pepClusters subfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster. "x" is the cluster number.
20210101_neighborHypothetical_genE_tgCluster_x_mafft.fa File. In pepClusters subfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster, MAFFT-aligned. "x" is the cluster number.

When pvclust is used to assign families

Note: likely to be deprecated.

20210101_neighborHypothetical_genE_pvClustFile.txt File. Cluster information in text file format.
20210101_neighborHypothetical_genE_pvClustTree.pdf File. Image of pvclust tree at final alpha value cutoff, as a .pdf file.
20210101_neighborHypothetical_genE_pvCluster_x.fa File. In subfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster. "x" is the cluster number.
20210101_neighborHypothetical_genE_pvCluster_x_mafft.fa File. In subfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster, MAFFT-aligned. "x" is the cluster number.