In this section we use GeneGrouper to search for the MexAB-OprM gene cluster in 15 genomes drawn from six taxa and then visualize the results.

If you need help installing GeneGrouper or its dependencies, please see the installation help page

Download data

Create the gg_tutorial directory and cd into it.

mkdir gg_tutorial
cd ./gg_tutorial

Afterwards, download the data using either option 1 or option 2:

Option 1: Download data with svn

svn checkout https://github.com/agmcfarland/GeneGrouper/trunk/test_data/genomes
svn checkout https://github.com/agmcfarland/GeneGrouper/trunk/test_data/query_genes

Option 2: Download data with DownGit by clicking the two links below

Download genomes

Download query genes

Make sure to drag and drop the two folders into gg_tutorial and unzip them.

Afterwards, make sure that the following directory structure is present:

├── gg_tutorial
│   ├── genomes
│   ├── query_genes

After the data has been downloaded using either method, gunzip all genomes in ./genomes folder

gunzip ./genomes/*.gz

There will be 15 genomes total composed of four Salmonella enterica, three Klebsiella pneumoniae, four Citrobacter spp., and four Pseudomonas aeruginosa genomes.

All done! Now we can build a database of the genomes using GeneGrouper

Step-by-step tutorial

Build the GeneGrouper database

The database only needs to be built once for all the genomes. Afterwards, it can be searched as many times as you want.

Run the following code:

GeneGrouper \
-g genomes -d example_search \
build_database

You should have now have the following directory structure:

├── gg_tutorial
│   ├── genomes
│   ├── query_genes
│   ├── example_search
│   │   ├── assemblies
│   │   ├── blast_database
│   │   ├── genomes.db

Search for the MexAB-OprM gene cluster and group

The MexAB-OprM gene cluster codes for the MexAB-OprM efflux pump.

It is expected that only Pseudomonas aeruginosa genomes carry the MexAB-OprM gene cluster in one copy. Since four of our 15 genomes are Pseudomonas aeruginosa, we expect to find one group containing all four MexAB-OprM gene clusters. Importantly, other groups containing similar but unrelated gene clusters will also be found.

Our search will use mexB as the query gene and find all matching genes with a minimum 20% identity and 80% coverage. Genes that meet this criteria are called seed genes. All genes 5,000 basepairs upstream and downstream of each seed gene will be extracted and used for grouping.

Run the following code:

GeneGrouper \
-n mexb -d example_search -g genomes \
find_regions \
-f query_genes/mexb.txt \
-us 5000 \
-ds 5000 \
-i 20 \
-c 80

You should now have the following directory structure:

├── gg_tutorial
│   ├── genomes
│   ├── query_genes
│   ├── example_search
│   │   ├── assemblies
│   │   ├── blast_database
│   │   ├── genomes.db
│   │   ├── mexb
│   │   │   ├── internal_data
│   │   │   ├── visualizations
│   │   │   ├── group_region_seqs.faa
│   │   │   ├── seed_results.db
│   │   │   ├── group_regions.csv
│   │   │   ├── group_statistics_summmary.csv
│   │   │   ├── group_taxa_summary.csv
│   │   │   ├── representative_group_member_summary.csv

Each of these .csv files contains important grouping information that can be explored further.

However, we can use GeneGrouper's many visualizations to first get an overview of groups, what gene clusters they contain, and in which taxa they are found.

Visualize main grouping results

Run the following code

GeneGrouper \
-n ./mexb -d ./example_search \
visualize \
--visual_type main

# view visualizations
open ./example_search/mexb/visualizations/*

This command produces three visualizations, all stored in the visualizations folder. I have annotated them for clarity.

1. group_summary_1.png

This visual shows the number of unique groups found, how many gene clusters are in a group, and how similar gene clusters in a group are to each other.
We can see that MexAB-OprM is found in group 1 (g1). Group 1 has four gene clusters, all that are 0 dissimilar to each other. In other words, every gene cluster in this group has identical gene content.

taxa_searched.png

This graph displays the number of genomes that were searched in each taxa and how many of them had at least one gene cluster in any group.
In our search, all genomes had at least one gene cluster in a group. This is expected as efflux pumps similar to MexAB-OprM are found in all Gram-negative bacteria.

groups_by_taxa_1.png

This heatmap shows the percentage of genomes in each taxa that had a gene cluster in a group. An asterisks indicates if a genome had more than one gene cluster in a group. This does not affect the percentage displayed.
We can see that some groups are associated with specific taxa and others are found in multiple taxa. Group 1 (g1), contains all MexAB-OprM and are found exclusively in Pseudomonas genomes, as we expected.

Inspect a group more closely

Sometimes a group warrants further inspection. Let's take a closer look at group -1 (g-1). Group -1 is special because it contains all gene clusters that could not be placed into their own unique groups.

Run the following code

GeneGrouper \
-n ./mexb -d ./example_search \
visualize \
--visual_type group \
 --group_label -1 

# view visualizations
open ./example_search/mexb/visualizations/*

Each time a different group is inspected, two additional .csv files for that group are made and placed into the folder subgroups. The visualization is saved to the visualizations folder.

inspect_group_g-1.png

The visualization shows gene clusters as subgroups. Subgroups contain all gene clusters that have identical gene content. The number of times that subgroup occurs is displayed on the left. The dissimilarity of gene content relative to subgroup 0 (s0) is present on the right.
We can see that some subgroups occur more than once.

Visualize phylogenetic relationships

We can see how phylogenetically similar the seed genes from each group are to each other.

Run the following code

GeneGrouper \
-n mexb -d example_search \
visualize \
--visual_type tree \
--image_format svg \
--tip_label_type group \
--tip_label_size 4

# view visualizations
open ./example_search/mexb/visualizations/*

This produces the following visualization

representative_seed_phylogeny.png

The phylogenetic relationships of each seed gene in each group are now visible.

All-at-once tutorial

# Create the database (only needs to be one once per set of genomes)
GeneGrouper \
-g genomes -d example_search \
build_database

# Search for MexAB-OprM gene cluster
GeneGrouper \
-n mexb -d example_search -g genomes \
find_regions \
-f query_genes/mexb.txt \
-us 5000 \
-ds 5000 \
-i 20 \
-c 80

# Visualize main grouping results
GeneGrouper \
-n ./mexb -d ./example_search \
visualize \
--visual_type main

# Inspect the subgroups within group -1 
GeneGrouper \
-n ./mexb -d ./example_search \
visualize \
--visual_type group \
 --group_label -1 

# Make a phylogenetic tree of each group representative's seed gene
GeneGrouper \
-n mexb -d example_search \
visualize \
--visual_type tree \
--image_format svg \
--tip_label_type group \
--tip_label_size 4

# view visualizations
open ./example_search/mexb/visualizations/*

# view output data
open ./example_search/mexb/

# view subgroup data
open ./example_search/mexb/subgroups

What next?

GeneGrouper produces a lot of data for each search. You can try to further inspect additional groups by changing the --group_label parameter.

You can also re-run the search but specify a --min_group_size. For example, a --min_group_size 2 will make groups with at least two gene clusters.

See advanced usage for more ideas.

You can also inspect the .csv files for a run to see how you could further parse the data for your own work! See an explanation of output files

Additional questions? View the FAQ

GeneGrouper tutorial with data - agmcfarland/GeneGrouper GitHub Wiki

Download data

Step-by-step tutorial

Build the GeneGrouper database

Search for the MexAB-OprM gene cluster and group

Visualize main grouping results

Inspect a group more closely

Visualize phylogenetic relationships

All-at-once tutorial

What next?

⚠️ GitHub.com Fallback ⚠️

GeneGrouper tutorial with data - agmcfarland/GeneGrouper GitHub Wiki

Download data

Step-by-step tutorial

Build the GeneGrouper database

Search for the MexAB-OprM gene cluster and group

Visualize main grouping results

Inspect a group more closely

Visualize phylogenetic relationships

All-at-once tutorial

What next?

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️