GeneGrouper tutorial with data - agmcfarland/GeneGrouper GitHub Wiki
In this section we use GeneGrouper to search for the MexAB-OprM gene cluster in 15 genomes drawn from six taxa and then visualize the results.
If you need help installing GeneGrouper or its dependencies, please see the installation help page
Create the gg_tutorial
directory and cd
into it.
mkdir gg_tutorial
cd ./gg_tutorial
Afterwards, download the data using either option 1 or option 2:
Option 1: Download data with svn
svn checkout https://github.com/agmcfarland/GeneGrouper/trunk/test_data/genomes
svn checkout https://github.com/agmcfarland/GeneGrouper/trunk/test_data/query_genes
Option 2: Download data with DownGit by clicking the two links below
Make sure to drag and drop the two folders into gg_tutorial and unzip them.
Afterwards, make sure that the following directory structure is present:
├── gg_tutorial
│ ├── genomes
│ ├── query_genes
After the data has been downloaded using either method, gunzip
all genomes in ./genomes folder
gunzip ./genomes/*.gz
There will be 15 genomes total composed of four Salmonella enterica, three Klebsiella pneumoniae, four Citrobacter spp., and four Pseudomonas aeruginosa genomes.
All done! Now we can build a database of the genomes using GeneGrouper
The database only needs to be built once for all the genomes. Afterwards, it can be searched as many times as you want.
Run the following code:
GeneGrouper \
-g genomes -d example_search \
build_database
You should have now have the following directory structure:
├── gg_tutorial
│ ├── genomes
│ ├── query_genes
│ ├── example_search
│ │ ├── assemblies
│ │ ├── blast_database
│ │ ├── genomes.db
The MexAB-OprM gene cluster codes for the MexAB-OprM efflux pump.

It is expected that only Pseudomonas aeruginosa genomes carry the MexAB-OprM gene cluster in one copy. Since four of our 15 genomes are Pseudomonas aeruginosa, we expect to find one group containing all four MexAB-OprM gene clusters. Importantly, other groups containing similar but unrelated gene clusters will also be found.
Our search will use mexB as the query gene and find all matching genes with a minimum 20% identity and 80% coverage. Genes that meet this criteria are called seed genes. All genes 5,000 basepairs upstream and downstream of each seed gene will be extracted and used for grouping.
Run the following code:
GeneGrouper \
-n mexb -d example_search -g genomes \
find_regions \
-f query_genes/mexb.txt \
-us 5000 \
-ds 5000 \
-i 20 \
-c 80
You should now have the following directory structure:
├── gg_tutorial
│ ├── genomes
│ ├── query_genes
│ ├── example_search
│ │ ├── assemblies
│ │ ├── blast_database
│ │ ├── genomes.db
│ │ ├── mexb
│ │ │ ├── internal_data
│ │ │ ├── visualizations
│ │ │ ├── group_region_seqs.faa
│ │ │ ├── seed_results.db
│ │ │ ├── group_regions.csv
│ │ │ ├── group_statistics_summmary.csv
│ │ │ ├── group_taxa_summary.csv
│ │ │ ├── representative_group_member_summary.csv
Each of these .csv
files contains important grouping information that can be explored further.
However, we can use GeneGrouper's many visualizations to first get an overview of groups, what gene clusters they contain, and in which taxa they are found.
Run the following code
GeneGrouper \
-n ./mexb -d ./example_search \
visualize \
--visual_type main
# view visualizations
open ./example_search/mexb/visualizations/*
This command produces three visualizations, all stored in the visualizations
folder. I have annotated them for clarity.
1. group_summary_1.png
-
This visual shows the number of unique groups found, how many gene clusters are in a group, and how similar gene clusters in a group are to each other.
-
We can see that MexAB-OprM is found in group 1 (g1). Group 1 has four gene clusters, all that are 0 dissimilar to each other. In other words, every gene cluster in this group has identical gene content.

taxa_searched.png
-
This graph displays the number of genomes that were searched in each taxa and how many of them had at least one gene cluster in any group.
-
In our search, all genomes had at least one gene cluster in a group. This is expected as efflux pumps similar to MexAB-OprM are found in all Gram-negative bacteria.

groups_by_taxa_1.png
-
This heatmap shows the percentage of genomes in each taxa that had a gene cluster in a group. An asterisks indicates if a genome had more than one gene cluster in a group. This does not affect the percentage displayed.
-
We can see that some groups are associated with specific taxa and others are found in multiple taxa. Group 1 (g1), contains all MexAB-OprM and are found exclusively in Pseudomonas genomes, as we expected.

Sometimes a group warrants further inspection. Let's take a closer look at group -1 (g-1). Group -1 is special because it contains all gene clusters that could not be placed into their own unique groups.
Run the following code
GeneGrouper \
-n ./mexb -d ./example_search \
visualize \
--visual_type group \
--group_label -1
# view visualizations
open ./example_search/mexb/visualizations/*
Each time a different group is inspected, two additional .csv
files for that group are made and placed into the folder subgroups
. The visualization is saved to the visualizations
folder.
inspect_group_g-1.png
-
The visualization shows gene clusters as subgroups. Subgroups contain all gene clusters that have identical gene content. The number of times that subgroup occurs is displayed on the left. The dissimilarity of gene content relative to subgroup 0 (s0) is present on the right.
-
We can see that some subgroups occur more than once.

We can see how phylogenetically similar the seed genes from each group are to each other.
Run the following code
GeneGrouper \
-n mexb -d example_search \
visualize \
--visual_type tree \
--image_format svg \
--tip_label_type group \
--tip_label_size 4
# view visualizations
open ./example_search/mexb/visualizations/*
This produces the following visualization
representative_seed_phylogeny.png
- The phylogenetic relationships of each seed gene in each group are now visible.

# Create the database (only needs to be one once per set of genomes)
GeneGrouper \
-g genomes -d example_search \
build_database
# Search for MexAB-OprM gene cluster
GeneGrouper \
-n mexb -d example_search -g genomes \
find_regions \
-f query_genes/mexb.txt \
-us 5000 \
-ds 5000 \
-i 20 \
-c 80
# Visualize main grouping results
GeneGrouper \
-n ./mexb -d ./example_search \
visualize \
--visual_type main
# Inspect the subgroups within group -1
GeneGrouper \
-n ./mexb -d ./example_search \
visualize \
--visual_type group \
--group_label -1
# Make a phylogenetic tree of each group representative's seed gene
GeneGrouper \
-n mexb -d example_search \
visualize \
--visual_type tree \
--image_format svg \
--tip_label_type group \
--tip_label_size 4
# view visualizations
open ./example_search/mexb/visualizations/*
# view output data
open ./example_search/mexb/
# view subgroup data
open ./example_search/mexb/subgroups
GeneGrouper produces a lot of data for each search. You can try to further inspect additional groups by changing the --group_label
parameter.
You can also re-run the search but specify a --min_group_size
. For example, a --min_group_size 2
will make groups with at least two gene clusters.
See advanced usage for more ideas.
You can also inspect the .csv
files for a run to see how you could further parse the data for your own work! See an explanation of output files
Additional questions? View the FAQ