Running prettyClusters with GenBank files - g-e-kenney/prettyClusters GitHub Wiki

Are GenBank files compatible with prettyclusters?

Yes, though it may take a bit of extra work. Getting relatively standard GenBank files that have halfway-functional annotations can be annoying: many genomes are poorly annotated, and a surprising number of sources and programs yield malformed GenBank files that will break some of the tools used here. (Common issues: no source attribute for the scaffold, no gene attribute at all, no locus_tag attribute for genes and proteins, locus_tag attributes that are too long or use problem characters, genomes that are randomly not annotated even in NR... the list is endless). I've got a rough guide on some of the steps that might occasionally be needed to get usable GenBank files, depending on the source (unannotated MAGs or WGS datasets, poorly annotated data, etc.). Once you're armed with some relatively well-formed GenBank files, though, gbToIMG and incorpIprScan are functions that can replace generateNeighbors as the entry point into the prettyClusters tool set. Note that antiSMASH-annotated GenBank files are now also accepted.

(While GFF files have a less challenging format, they are often broken in equivalent ways, and are more likely to lack sequences, which we want/need going forward, so GenBank may still be the best starting point.)

Preparing for gb2IMG

  • You'll need a folder full of GenBank files.
  • You'll need a list of locus tags for your genes of interest. If you don't have this, you can re-generate it by extracting amino acid sequences from your GenBank files and redoing BLAST searches.
  • You'll also need fasta files with amino acid sequences that correspond to each genome.

To get those fasta files - and if necessary re-do those blast searches - you have a few options. One option: install some of the broadly useful bit tools via conda, load the environment, and use bit-genbank-to-AA-seqs.

bit-genbank-to-AA-seqs -i sequence.gbk -f sequence.faa   

This handles GenBank files without locus_tags decently, and by default provides sequence names that include the locus_tag and protein_id (helpful for next steps). However, it occasionally fails with poorly formatted GenBank files; a simpler script below (gb2faa.py) may help with that (specify by commenting out lines as described whether you want sequences named by locus_tag or protein_id):

##!/opt/local/bin/python

# Modified from: https://github.com/fhsantanna/bioinfo_scripts/blob/master/gbk2faa.py
# Usage: python gb2faa.py <input> <output> 
# things to import
import sys
from Bio import SeqIO
from Bio import GenBank

# files
gbk_filename = sys.argv[1]
faa_filename = sys.argv[2]
input_handle  = open(gbk_filename, "r")
output_handle = open(faa_filename, "w")

# actual thing
for seq_record in SeqIO.parse(input_handle, "genbank") :
    print("Dealing with GenBank record %s" % seq_record.id)
    for seq_feature in seq_record.features :
        try: # Without "try", it crashes when it finds a CDS without translation (pseudogene).                                                                                      
            if seq_feature.type=="CDS" :
                assert len(seq_feature.qualifiers['translation'])==1
                output_handle.write(">%s\n%s\n" % (
                   # comment the next line if you are dealing with things that have protein_id but no locus_tag
                   seq_feature.qualifiers['locus_tag'][0],
                   # uncomment the next line if you're dealing with things that have protein_id but no locus_tag
                   # seq_feature.qualifiers['protein_id'][0],
                   # if we wanted to include record name (probably not, makes names long)
                   # seq_record.name ">%s from %s\n%s\n" % (                                                                                         
                   seq_feature.qualifiers['translation'][0]))
                pass
        except:
            continue
output_handle.close()
input_handle.close()

You can run file-by-file (though that's slow - see the top line below). You can also get a list of .gb files in that directory and run the .faa generation script by running things on the command line as follows (adapt for ".gbk" if necessary, and use the bit tool instead instead if preferred). If you need to re-identify genes of interest, you can add one last step to glom all the amino acid seqs into one big file for one big BLAST database (bottom line):

python gb2faa.py genome.gb genome.faa

ls *.gb > gb-list.txt
while read -r line; do newname=${line##*/}; newbase=${newname%.gb}; outname=${newbase}.faa; python gb2faa.py ${newname} ${outname}; done < gb-list.txt; 

If you need to blast for your proteins of interest, you can combine your input .faa files into one large one, and then BLAST starting with a .fa-formatted input sequence or multi-fasta set of sequences (here PROTEIN.faa.) Depending on your protein family, you can tweak the blast parameters as needed (e.g. percent identity cutoffs.)

cat *.faa > combo.faa
makeblastdb -in combo.faa -dbtype prot -out combo
blastp -db combo -query PROTEIN.faa -outfmt 6 -out combo.tsv -num_threads 4 

The output format for this is a tab-delimited textfile, and the IDs of subject sequences will be in the sseqid column. You can filter results based on percent ID, e-value, etc. The protein sequence IDs can be used for your locus tag input text file. Note that if there should be just one hit for your protein family per scaffold sequence, you may find it easier to just return the best result, and to do that you can add another term to your blast search(-max_targ_seqs 1) - but be careful, because this may not be good assumption for all protein families.

If the number of hits is smaller than it should have been, there's a good chance some of your GenBank files were malformed after all. Check to see which ones are mangled, and handle them by hand. Yes, this is terrible.

Generating gene and neighbor sequences and metadata with gbToIMG

The gbToIMG function takes a folder full of GenBank files (.gb/.gbk) and corresponding amino acid sequences (.faa) and a list of locus tags for your genes of interest, and uses that to construct faux-IMG-formatted metadata and protein sequence files that can be used with the rest of the prettyClusters toolset. Note that this function thus effectively replaces generateNeighbors. GenBank files with no annotations will be skipped (if you have a lot of these, consider following my DIY annotation guide), as will annotated GenBank files that do not have any locus tags or protein IDs matching entries on your list of genes of interest.

gtiOut <- gbToIMG(dataFolder="/user/data/gbfiles", goiListInput = "goiList.txt", neighborNum=10, geneName="genE", removeDupes=TRUE)

Output for gbToIMG

If everything's gone well, you should have the same output you would for generateNeighbors (followed by IMG-mediated download of neighbor metadata): new amino acid .faa files for protein sequences for genes of interest and their neighbors, and IMG-formatted metadata for genes and their neighbors, with an additional file contextualizing relationships between neighbors and genes of interest. Note that gene, genome, and scaffold numbering are obviously faux-IMG only (the numbers are outside the ranges used there currently, and are generated fresh with every gbToIMG run), and the annotations will only be as good as the input GenBank files. (antiSMASH GenBank files work, though depending how you generated them, locus_tag names may or may not match other databases.)

head(gtiOut$fauxIMGgenes[,1:5])
        Locus.Tag    gene_oid          Gene.Product.Name  Genome.ID    Genome.Name	
1 HMPREF0776_0501  3001600011             beta-lactamase 4001600000    Staphylococcus aureus subsp. aureus USA300_TCH959 	
2 BCAH1134_1083    3025700011             beta-lactamase 4025700000    Bacillus cereus AH1134
3 BCAH1134_C0231   3029100011 cyclic peptide transporter 4029100000    Bacillus cereus AH1134
4 HMPREF0769_10704 3029400011             beta-lactamase 4029400000    Staphylococcus aureus subsp. aureus MN8
5 HMPREF0793_0533  3031900011             beta-lactamase 4031900000    Staphylococcus caprae M23864:W1			

Some categories of annotations are not carried over in the default configuration of this function because they tend to be too heterogeneous; others may be a little ad hoc (protein ids as "locus_tag" for genes that lack them, etc.). Others (especially strain IDs) import somewhat unreliably depending on data source. You WILL want to sanity-check and potentially manually curate some of the annotations here. If significant fixes are required, I'll often make any fixes in the gene-of-interest metadata and then use Terrible Excel Functions like INDEX/MATCH to populate the updated info in the larger neighbor metadata file.

Improving metadata quality with incorpIprScan

Note that actually running InterProScan is outside the purview of this guide, since it's generally linux/cluster only and thus less widely available than this toolset. However, I strongly recommend figuring out a way to do it, since many GenBank files do not actually include explicit protein family information, and your ability to use the rest of the toolset will be significantly affected by mediocre annotation (you'll have a lot of spuriously labeled hypothetical proteins.) Note that since InterPro info is not yet always included in IMG datasets, if you will be incorporating IMG and non-IMG data and you want to use InterPro annotations, you might want to run incorpIprScan on the IMG neighbor sequences/metadata as well, with addPfam and addTigrfam as FALSE and addIPRfam as TRUE. Barring those sorts of situations, the basic command that I use is as follows:

./interproscan.sh -i neighborSeqs.fa -t p -o neighborSeqs_iprScan.txt -f TSV -appl Pfam,TIGRFAM

An additional function - incorpIprScan - can be used to add those protein family annotations back into the faux-IMG metadata. More info on the function wiki page, but basic usage:

incorpIprScanOutput <- incorpIprScan(iprScanSource = "genE_iprScan.txt", imgNeighborsSource = "gbToIMG_genE_neighbors.txt", geneName = "genE", addPfam = TRUE, addTigrfam = TRUE, addIPRfam = FALSE)

Output for incorpIprScan

The only output for this function is the neighbors metadata file - hopefully now with more entries in the Pfam, TIGRfam, and InterPro columns. Note that the sheer number of InterPro entries some proteins have points towards why you may not want to use them for neighborhood clustering - it can result in overweighting of specific types of neighboring genes.

head(incorpIPRscan$fauxIMGneighbors[,21:29]
       Pfam   Tigrfam SMART.ID SUPERFam.ID CATH.FunFam.ID Enzyme KO IMG.Term                                                    InterPro
1 pfam02595 TIGR00045                                                        IPR004381 IPR018197 IPR018197 IPR018193 IPR004381 IPR004381
2 
3 pfam07690 TIGR00710                                                                  IPR011701 IPR020846 IPR036259 IPR004812 IPR020846
4 pfam06445                                                                                      IPR011256 IPR029442 IPR010499 IPR011256
5 pfam13529                                                                                                                    IPR016997
6 pfam00149 TIGR03729                                                                                      IPR022302 IPR004843 IPR029052

Moving onto the rest of the prettyClusters toolset

At this point, one quick note: If you are using a combination of IMG and non-IMG data, this is the point where you will want to combine the datasets. That should include sequence and metadata files for your post-generateNeighbors IMG data and sequence and metadata files for your post-incorpIprScan (or post-gbToIMG) non-IMG data. The laziest way to do this is via the command line, but you can also use your text editor of choice, or something like Excel (for the tabular data):

cat 20210101_generateNeighbors_genE.fa 20210101_gb2img_genE_geneSeqs.fa > 20210101_combo_genE.fa 
cat 20210101_img_genE_neighbors.fa 20210101_gb2img_genE_neighborSeqs.fa > 20210101_combo_genE_neighbors.fa 
{ sed 0,0d 20210101_img_genE.txt ; sed '1d' 20210101_gb2img_genE_geneData.txt } > 20210101_combo_genE.txt 
{ sed 0,0d 20210101_img_genE_neighbors.txt ; sed '1d' 220210101_incorpIprScan_genE_neighborData.txt } > 20210101_combo_genE_neighbors.txt
{ sed 0,0d 20210101_generateNeighbors_genE_context.txt ; sed '1d' 20210101_gb2img_genE_neighborcontext.txt } > 20210101_combo_genE_context.txt

However, two important notes before you move ahead: you'll need to delete the extra set of column headers that will end up in the metadata and context files if you do this (which the sed commands above do)... and you'll want to check and make sure that columns are all in the same order. They should be, but there are occasionally issues - like an InterPro column that is there in this data but not the IMG data, or the other way 'round - and these are admittedly easier to track if you don't auto-trim the extra headers. This is also a good point to go into the metadata and make any changes necessary. There isn't a one-to-one correlation between GenBank features and IMG metadata fields, and while gbToIMG does its best, heterogenous data can cause issues. In particular, the Genome Name & Scaffold Name may not be optimal, and the accession may need to be moved to another column. For MAGs and metagenomes, you'll probably need to find some compromise that re-uses a not-quite-appropriate column for the bin, the metagenome source, etc. The Genome Name column is used in prettyClusterDiagrams, so try to keep that unique, legible, and informative.

Once you've done this, you're ready to move onto the standard workflow, starting with prepNeighbors.

⚠️ **GitHub.com Fallback** ⚠️