Function: prepNeighbors - g-e-kenney/prettyClusters GitHub Wiki
prepNeighbors
prepNeighbors takes input from generateNeighbors or gbToIMG/incorpIprScan, i.e. IMG-formatted metadata files for genes of interest and their neighbors, along with fasta-formatted files for their predicted protein sequences. It's actually a wrapper for two subfunctions:
neighborHypotheticalThis tool identifies sets of hypothetical proteins that are overrepresented in the genomic neighborhood of the genes of interest. Beyond R packages, this tool uses the NCBI blast+ package and MAFFT, which need to be pre-installed. Note that for identifying subgroups of hypothetical proteins, an all-by-all blast step is used, which can lead to long computation time on large datasets! Tidygraph (with ggraph) and/or pvclust are used identify and visualize possible cutoff points for defining those subgroups of hypothetical proteins. For computational reasons, tidygraph is likely to be quicker for large datasets.neighborTrimThis tool identifies genomic neighborhoods where contig ends result in small neighborhoods. Since truncated neighborhoods can mis-weight genome neighborhood clustering, they can optionally be removed from further analyses. Additionally, ostensible neighbors that aren't actually on the same scaffold as the gene of interest are removed.
Use of prepNeighbors
A minimal run, just doing basic quality control and trimming any false neighbors that aren't on the same contig/scaffold as our gene of interest.
prepNeighborsOut <- prepNeighbors(imgGenes = "20210101_img_genE.txt", imgNeighbors = "20210101_img_genE_neighbors.txt", geneSeqs = "20210101_generateNeighbors_genE.fa", neighborSeqs = "20210101_img_genE_neighbors.fa", neighborsContext = "20210101_generateNeighbors_genE_context.txt", geneName = "genE", neighborNumber = 10)
A fancier one, in which hypothetical proteins and all peptides (<175 aa) are identified (via an all-by-all blast & some basic clustering/network analysis) and annotated, MAFFT alignments of hypothetical protein subgroups are produced, and truncated contigs (where the number of neighbors is incomplete) are removed. Also specified: the filesystem (to get BLAST & MAFFT to run right) & the number of processor threads to be used (to speed up the run.)
prepNeighborsOut <- prepNeighbors(imgGenes = "20210101_img_genE.txt", imgNeighbors = "20210101_img_genE_neighbors.txt", geneSeqs = "20210101_generateNeighbors_genE.fa", neighborSeqs = "20210101_img_genE_neighbors.fa", neighborsContext = "20210101_generateNeighbors_genE_context.txt", geneName = "genE", neighborNumber = 10, sysTerm = "nix", hypoAnalysis = TRUE, clustMethod = "tidygraph", numThreads = 7, trimShortClusters = TRUE, pepScreen=TRUE, pepMax = 175, alnClust=TRUE)
Additional advanced settings are available!
Required options
These must be provided for all runs.
imgGenesFilename. IMG-formatted metadata file for genes of interest. Required.imgNeighborsFilename. IMG-formatted metadata file for neighbors of genes of interest. Required.geneSeqsFilename. FASTA-formatted sequence file for genes of interest. Required. The version of this file created bygenerateNeighborstrims sequence names down to the gene_oid and can be used for input - annotations in FASTA names occasionally produce errors.neighborSeqsFilename. FASTA-formatted sequence file for neighbors of genes of interest. Required.neighborsContextFilename. File generated ingenerateNeighborsthat ties the metadata of the neighbors to the gene_oids and scaffold IDs of the original genes of interest. Required.geneNameText. the name of your gene family of interest - used for filenames and figure labels. Required.neighborNumInteger. The number of neighboring genes on either side of the gene of interest that will be investigated. Required.
Advanced options
efiRepnodesBoolean. Just used for filename generation. Defaults to FALSE.
Options related to trimming clusters on truncated contigs
trimShortClustersBoolean. Signals whether or not to remove gene clusters where a truncated contig interrupts the cluster and a sub-sized cluster remains. (Recommended when running further analyses and not just generating cluster diagrams.) Defaults to FALSE.
Options related to identifying hypothetical or custom protein families
hypoAnalysisBoolean. Signals whether or not subsets of similar hypothetical proteins will be identified. Defaults to FALSE.sysTermText. Used to identify the sort of terminal (current options: "wsl" for the Linux subsystem on Windows and "nix" for generic Linux, MacOS, etc.) Used to run blast and mafft commands. Required for protein family analysis. Defaults to "nix".numThreadsInteger. Number of threads to use during analysis. Maximum value depends on your computer (bigger is better). Defaults to 1.neighborThresholdNumber. (0-1.00). Fraction of gene clusters in which a protein family must occur to be included in further analyses. Broadly, used to determine how widely the net for subsets of hypothetical proteins is cast. Defaults to 0.025 (2.5%).clustMethodText. The name of the package to be used to identify clusters of similar hypothetical proteins. Tidygraph and pvclust are the current options, with the former currently suggested. Defaults to "tidygraph".pidCutoffNumber. (0-100). BLAST %ID cutoff to be used for edge generation when analyzing protein families using tidygraph. Defaults to 35 (35% ID).pepScreenBoolean. Specifies whether or not to assign provisional families to all proteins below a size cutoff regardless of annotation status. Defaults to FALSE.pepMaxNumber. Specifies the size cutoff (aa) for pepScreen - increase with caution on large datasets, since this uses an all-by-all blast. Defaults to 150.matchLengthNumber (0-1). BLAST matches need to above this fraction of of the length of the smaller seq. Defaults to .65; can be appropriate to use a lower cutoff when working with things like RiPP precursor peptides, where some regions are expected to have significant sequence diversity.alnClustBoolean. Specifies whether or not to make MAFFT multiple sequence alignments for hypothetical protein or peptide families. Defaults to FALSE.hmmClustBoolean. NOT IMPLEMENTED YET. Specifies whether or not to make an HMM for protein or peptide families using a randomly selected subset of sequences. Defaults to FALSE.alphaValNumber. (0.0-1.0). Alpha value cutoff for analyzing pvClust data (likely to be deprecated). Defaults to 0.95.bootStrapNumber. Bootstrap number for pvClust (likely to be deprecated). Can be fairly computationally intensive: 100 is a good starting point for small datasets (e.g. 100 genes of interest, 5 genes on either size), and 10 for larger (500 genes of interest or more, 10+ genes in either direction.) Defaults to 10.
Output for next function
20210101_neighborTrim_genE_genes.txtFile. IMG-formatted tab-delimited metadata text file of genes of interest with genes from short contigs removed.20210101_neighborTrim_genE_neighbors.txtFile. IMG-formatted tab-delimited metadata text file of neighbors of genes of interest with genes from short contigs or from incorrect scaffolds. Contains Hypofam annotations if neighborHypothetical was run.20210101_neighborTrim_genE_neighborsContext.txtFile. Tab-delimited list of IMG gene_oids for neighbors, their source genes, and the scaffolds of the source genes, with neighbors from short contigs or on incorrect scaffolds removed.20210101_neighborTrim_genE_geneSeqs.faFile. FASTA-formatted set of amino acid sequences, trimmed to match the updated gene list.20210101_neighborTrim_genE_neighborSeqs.faFile. FASTA-formatted set of amino acid sequences, trimmed to match the updated neighbor list.
Additional output
When hypothetical protein families are assigned by tidygraph
20210101_neighborHypothetical_genE_hypoSeqs.faFile. FASTA file containing amino acid sequences of all putative hypothetical proteins20210101_neighborHypothetical_genE_blastFile.txtFile. All-by-all BLAST results for hypothetical sequences in a tab-delimited format20210101_neighborHypothetical_genE_clusterList.txtFile. Tab-delimited text file with two columns listing the gene_oid of every hypothetical protein and its cluster group (entered in the annotation as a "Hypofam" family in a "Hypofam" column.)20210101_neighborHypothetical_genE_networkFull_xx.pdfFile. "xx" is the percent ID cutoff. Vector image (PDF) of full network, with the gene of interest highlighted in red (if they are unannotated)20210101_neighborHypothetical_genE_networkClusters_xx.pdfFile. Vector image of tidygraph network at the percent ID cutoff of "xx" with small clusters and unclustered nodes removed.20210101_neighborHypothetical_genE_tgCluster_x.faFile. InhypoClusterssubfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster. "x" is the cluster number.20210101_neighborHypothetical_genE_tgCluster_x_mafft.faFile. InhypoClusterssubfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster, MAFFT-aligned. "x" is the cluster number.
When peptide families are assigned using tidygraph
20210101_neighborHypothetical_genE_pepSeqs.faFile. FASTA file containing amino acid sequences of all putative hypothetical proteins20210101_neighborHypothetical_genE_pepBlast.txtFile. All-by-all BLAST results for hypothetical sequences in a tab-delimited format20210101_neighborHypothetical_genE_pepClusterList.txtFile. Tab-delimited text file with two columns listing the gene_oid of every hypothetical protein and its cluster group (entered in the annotation as a "Pepfam" family in the "Hypofam" column.)20210101_neighborHypothetical_genE_pepNetworkFull_xx.pdfFile. "xx" is the percent ID cutoff. Vector image (PDF) of full network, with the gene of interest highlighted in red (if they are unannotated)20210101_neighborHypothetical_genE_pepNetworkClusters_xx.pdfFile. Vector image of tidygraph network at the percent ID cutoff of "xx" with small clusters and unclustered nodes removed.20210101_neighborHypothetical_genE_tgCluster_x.faFile. InpepClusterssubfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster. "x" is the cluster number.20210101_neighborHypothetical_genE_tgCluster_x_mafft.faFile. InpepClusterssubfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster, MAFFT-aligned. "x" is the cluster number.
When pvclust is used to assign families
Note: likely to be deprecated.
20210101_neighborHypothetical_genE_pvClustFile.txtFile. Cluster information in text file format.20210101_neighborHypothetical_genE_pvClustTree.pdfFile. Image of pvclust tree at final alpha value cutoff, as a .pdf file.20210101_neighborHypothetical_genE_pvCluster_x.faFile. In subfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster. "x" is the cluster number.20210101_neighborHypothetical_genE_pvCluster_x_mafft.faFile. In subfolder with other equivalent files. FASTA file containing amino acid sequences of members of a specific hypothetical protein cluster, MAFFT-aligned. "x" is the cluster number.