Other tutorials - agmcfarland/GeneGrouper GitHub Wiki

In this section we use GeneGrouper to search for two different genes and the gene clusters they represent.

If you need help installing GeneGrouper or its dependencies, please see the installation help page

Download data

Create the gg_tutorial directory and cd into it.

mkdir gg_tutorial
cd ./gg_tutorial

Afterwards, download the data using either option 1 or option 2:

Option 1: Download data with svn

svn checkout https://github.com/agmcfarland/GeneGrouper/trunk/test_data/genomes
svn checkout https://github.com/agmcfarland/GeneGrouper/trunk/test_data/query_genes

Option 2: Download data with DownGit by clicking the two links below

Download genomes

Download query genes

Make sure to drag and drop the two folders into gg_tutorial and unzip them.

Afterwards, make sure that the following directory structure is present:

├── gg_tutorial
│   ├── genomes
│   ├── query_genes

After the data has been downloaded using either method, gunzip all genomes in ./genomes folder

gunzip ./genomes/*.gz

There will be 15 genomes total composed of four Salmonella, three Klebsiella, four Citrobacter, and four Pseudomonas genomes.

All done! Now we can build a database of the genomes using GeneGrouper

Build the GeneGrouper database

GeneGrouper \
-g genomes -d example_search \
build_database

You should have now have the following directory structure:

├── gg_tutorial
│   ├── genomes
│   ├── query_genes
│   ├── example_search
│   │   ├── assemblies
│   │   ├── blast_database
│   │   ├── genomes.db

Search for pdu gene cluster

We will start for the Pdu gene cluster using the pduA gene in all the genomes

A 2,000 upstream and 18,000 downstream search region will be used. The blast hit threshold will be set to 30% identity and 80% coverage relative to our pduA query gene.

# start search
GeneGrouper \
-n pdua -d example_search -g genomes \
find_regions \
-f query_genes/pdua.txt \
-us 2000 \
-ds 18000 \
-i 30 \
-c 80

You should now have the following directory structure:

├── gg_tutorial
│   ├── genomes
│   ├── query_genes
│   ├── example_search
│   │   ├── assemblies
│   │   ├── blast_database
│   │   ├── genomes.db
│   │   ├── pdua
│   │   │   ├── internal_data
│   │   │   ├── visualizations
│   │   │   ├── subgroups
│   │   │   ├── group_region_seqs.faa
│   │   │   ├── seed_results.db
│   │   │   ├── group_regions.csv
│   │   │   ├── group_statistics_summmary.csv
│   │   │   ├── group_taxa_summary.csv
│   │   │   ├── representative_group_member_summary.csv

Visualize the groups that were obtained by the search.

# make main visualizations
GeneGrouper \
-n pdua -d example_search \
visualize \
--visual_type main

# view visualizations
open ./example_search/pdua/visualizations/*

Two distinct groups appear: the Pdu gene cluster (g0) and the Eut gene cluster (g1). Note that 11 genomes have the Pdu gene cluster and 4 do not. The four genomes missing a Pdu gene cluster all belong to Pseudomonas aeruginosa. The right hand panel of the three-part visualization shows that there is some slight variation in gene content within the members of group 0.

Inspect group 0 to see what kind of variation in gene content exists within the group.

# inspect group label 0
GeneGrouper \
-n pdua -d example_search \
visualize \
--visual_type group \
--group_label 0

# view group 0 visualiation
open ./example_search/pdua/visualizations/inspect_group_0_1.png

The inspect group visualization shows how many unique gene architectures are present in the group, and number of members that have that architecture. As you can see, subgroup 0 has seven members with identical architecture. Interestingly, subgroup 3 is missing the pocR regulator. There is instead a transposase insertion! I wonder what effects that may have on regulation of the Pdu gene cluster in this genome?

Search for pst gene cluster

Now we'll explore how the five-gene Pst gene cluster, which is involved in phosphate transport, is distributed in our genomes.

Since we already have the database built, we will simply run a new search with different parameters. Since we are using an E. coli sequence, we will lower identity and coverage thresholds to account for more distant homology. We will use an 8,000 bp upstream/downstream distance of the seed gene to account for potential variation in gene content surrounding the gene cluster.

# start search
GeneGrouper \
-n psts_ecoli -d example_search -g genomes \
find_regions \
-f query_genes/psts_ecoli.txt \
-us 8000 \
-ds 8000 \
-i 10 \
-c 50 

# make main visualizations
GeneGrouper \
-n psts_ecoli -d example_search \
visualize \
--visual_type main

# view visualizations
open ./example_search/psts_ecoli/visualizations/*

Our visualizations show that four groups are identfied. All taxa except Pseudomonas aeruginosa have pstSCAB/phoU. However, genes surrounding the Pst gene cluster differ by taxa. This demonstrates the selective effect of maintaining an intact Pst gene cluster, while also demonstrating the phylogenetic influence of surrounding gene content.

View phylogenetic tree of seed genes from each representative group

GeneGrouper \
-n psts_ecoli -d example_search \
visualize \
--visual_type tree \
--image_format png \
--tip_label_type group \
--tip_label_size 4

open ./example_search/psts_ecoli/visualizations/representative_seed_phylogeny.png