Frequently Asked Questions - agmcfarland/GeneGrouper GitHub Wiki

1. Where can I download GenBank-format RefSeq genomes with file extension .gbff?

There are a couple of simple options I use.

Let's try them by downloading all Clostridium genomes with complete- or chromosome-level assemblies.

Option one is to use NCBI's Assembly Advanced Search Builder

  • Using the search builder, select 'Organism' and input 'clostridium'. Next, select 'Assembly Level' and select both 'chromosome' and 'complete'. You should get the following code generated automatically:

("clostridium"[Organism]) AND (("chromosome"[Assembly Level] OR "complete genome"[Assembly Level]))

  • Press 'Search'

  • Next click on the 'Download Assemblies' button.

  • Make sure the source database says 'RefSeq' and file type is 'Genomic GenBank Format (.gbff)'

Option 2 is to use ncbi-refseq-download

Use pip install ncbi-genome-download to download the package.

  • In the command line copy the following code:

ncbi-genome-download --section refseq --formats genbank --assembly-levels complete,chromosome --genera Clostridium bacteria

2. Can I build a database using gzipped .gbff genomes?

No, please extract all uncompressed .gbff genomes to a single folder.

3. How do I choose the correct upstream/downstream coordinates ?

GeneGrouper extracts a default 10,000 bp upstream/downstream of the seed gene. However, depending on the types of adjacent genes you are trying to capture, adjust the upstream and downstream settings. A rule of thumb is that 1,000 bp is equal to 1 gene length. You can also add additional distance to accommodate for potential insertions/deletions. If you are unsure of what you expect to see, use the default values first!

Try different settings and inspect how regions have grouped. It should be fast!

4. How do I choose the correct identity and coverage cutoff values for my seed genes?

This depends on your research question. If you want more distant homologs, then lower both. Generally, identity values lower than 30% suggest very low homology. More distant homologs will likely return gene regions with gene content very different from that closely related homologs to the query gene. But you never know what you might find!

5. What if I want to add more .gbff genomes to my genomes database?

Go to your genomes folder and add the new .gbff genomes. Afterwards run the following command:

GeneGrouper -g /path/to/gbff -d /path/to/output_directory build_database

This will update the database and all necessary files for GeneGrouper to use.

6. How many genomes can I run GeneGrouper on and how much time should it take?

GeneGrouper has been tested on 1,130 genomes using a i7 quadcore MacBook Pro. It took 6 minutes to build the database. A search takes an average of 3 minutes.

7. What operating systems does GeneGrouper work on?

GeneGrouper has been tested on Mac OS and Linux but not Windows.

8. Does GeneGrouper work with non-RefSeq GenBank format genomes?

GeneGrouper has only been tested on RefSeq GenBank format (.gbff) genomes.

9. Does GeneGrouper work on Eukaryotic, Archaeal, or viral genomes?

GeneGrouper has only been tested on bacterial genomes. It likely does not work at all on Eukaryotes.