Genome Alignments - Golob-Minot/geneshot GitHub Wiki
One of the ways to gain understanding from the results of a metagenomic
analysis is to compare the assembled sequence information against a
reference database of microbial genomes. With that external database
of genomic sequences which are each thought to correspond to a single
organism, longer contiguous sequences can be used to visualize the
spatial organization of genetic elements which may have only assembled
de novo into smaller fragments. In order to quickly process those
alignments for geneshot
results, a user may run the
Annotation of Microbial Genomes by Microbiome Association (AMGMA)
pipeline.
One of the key outputs of AMGMA is the information summarizing which CAGs contain genes which align to which genomes. A term used frequently in this analysis is 'containment', which refers to the proportion of genes from two sets which are found in both. In this case we could refer to the proportion of genes from a single CAG which also align to a single genome, we could refer to the proportion of genes which align to a genome which also belong to a single CAG, and we could refer to the 'containment' of the CAG/genome as the proportion of genes from the union of both sets which are also in the intersection of those sets.
After aligning the gene catalog generated by geneshot
against a collection
of reference genomes, AMGMA will estimate the relative abundance of the
genes which align to each genome using the aggregate proportion of gene copies
from each specimen which align to that genome. In this way, each genome
is assigned an 'abundance' value for each specimen in the experiment.
Using the abundance values for each genome, it is possible to estimate the
association of the relative abundance of the organisms containing that group
of genes with any experimental design. As implemented in AMGMA, the same formula
used to describe an experimental design in a set of geneshot
outputs will be
applied to the AMGMA results, with the same set of estimated coefficients generated
by corncob
for each experimental parameter.
Because each genome name can be quite long, an integer index is created
for each genome and used to refer to it in many of the outputs. The integer
index can be mapped back to the input genome using the /genomes/manifest
table.
In addition to the external genomes, AMGMA will align the gene catalog
against the set of long contigs (above a given size threshold) generated
de novo by geneshot
. For this reason, users of AMGMA will see results
for contig sequences (in which every contig name starts with the name
of the specimen it was assembled from) in addition to the genomes which
were input.
The output of AMGMA consists of three files:
-
*.hdf5
: Alignment information in HDF5 format (easily accessed with Python) -
*.rdb
: Alignment information in RDB format (used for visualization) -
*.annotations.hdf5
: Additional annotations for each genome (optional)
The output tables in the HDF5 file are as follows, with examples shown from a small dataset in which the experimental parameters are a series of different bacterial species labels:
Identifies the name and ID of each genome in the analysis.
index | id | name |
---|---|---|
0 | ERR1204060__GENE__k99_100__flag=1__multi=90.0000__len=44225 | ERR1204060__GENE__k99_100__flag=1__multi=90.0000__len=44225 |
1 | ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383 | ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383 |
2 | ERR1204060__GENE__k99_103__flag=1__multi=73.0000__len=25356 | ERR1204060__GENE__k99_103__flag=1__multi=73.0000__len=25356 |
3 | ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999 | ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999 |
4 | ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678 | ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678 |
5 | ERR1204060__GENE__k99_110__flag=1__multi=69.0000__len=59237 | ERR1204060__GENE__k99_110__flag=1__multi=69.0000__len=59237 |
Displays the estimated association of each genome (genome_ix
)
with each parameter of the experimental design.
genome_ix | parameter | estimate | p_value | std_error | q_value | neg_log10_qvalue | wald |
---|---|---|---|---|---|---|---|
158 | speciesClostridium scindens | -27 | 1 | 5e+04 | 1 | 7.4e-09 | -0.00054 |
158 | speciesClostridium symbiosum | -27 | 1 | 4.2e+04 | 1 | 7.4e-09 | -0.00064 |
158 | speciesEubacterium rectale | -27 | 1 | 4.6e+04 | 1 | 7.4e-09 | -0.00059 |
158 | speciesRuminococcus gnavus | -4.5 | 4.7e-05 | 0.83 | 0.00025 | 3.6 | -5.4 |
158 | speciesRuminococcus torques | -27 | 1 | 3.8e+04 | 1 | 7.4e-09 | -0.00071 |
159 | speciesClostridium scindens | -30 | 1 | 1.2e+05 | 1 | 7.4e-09 | -0.00025 |
159 | speciesClostridium symbiosum | -29 | 1 | 9.8e+04 | 1 | 7.4e-09 | -0.0003 |
159 | speciesEubacterium rectale | -30 | 1 | 1.1e+05 | 1 | 7.4e-09 | -0.00028 |
Displays the alignment of a gene catalog against a single genome
(identified with the id
from the manifest).
index | contig | gene | pident | contig_start | contig_end | contig_len | genome_id | CAG |
---|---|---|---|---|---|---|---|---|
2126 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | gene_534aa489_677aa | 1e+02 | 22220 | 20190 | 31515 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | 1 |
2127 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | gene_0476f716_627aa | 1e+02 | 16501 | 14621 | 31515 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | 1 |
2128 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | gene_c435bcdb_475aa | 93 | 18598 | 20022 | 31515 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | 1 |
2129 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | gene_5faa9628_467aa | 1e+02 | 23588 | 24988 | 31515 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | 1 |
2130 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | gene_bf5e732b_418aa | 1e+02 | 25005 | 26258 | 31515 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | 1 |
2131 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | gene_2632a2fa_422aa | 1e+02 | 5344 | 4079 | 31515 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | 1 |
2133 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | gene_5b1f5a24_414aa | 1e+02 | 17017 | 18258 | 31515 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | 1 |
2134 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | gene_8cd358e9_411aa | 1e+02 | 28314 | 27082 | 31515 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | 1 |
2135 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | gene_96932e78_362aa | 1e+02 | 3119 | 2034 | 31515 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | 1 |
2136 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | gene_8b2599b7_284aa | 1e+02 | 7014 | 6163 | 31515 | ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515 | 1 |
Describes the degree of overlap between CAG assignment of genes and the alignment of those genes against each genome.
genome | CAG | n_genes | containment | genome_prop | genome_bases | cag_prop |
---|---|---|---|---|---|---|
ERR1204060__GENE__k99_100__flag=1__multi=90.0000__len=44225 | 8 | 11 | 0.67 | 0.67 | 29631 | 0.004 |
ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383 | 8 | 50 | 0.9 | 0.9 | 45415 | 0.018 |
ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383 | 1 | 1 | 0.032 | 0.032 | 1602 | 0.00025 |
ERR1204060__GENE__k99_103__flag=1__multi=73.0000__len=25356 | 8 | 25 | 0.8 | 0.8 | 20211 | 0.009 |
ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999 | 8 | 43 | 0.8 | 0.8 | 31905 | 0.016 |
ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999 | 7 | 1 | 0.043 | 0.043 | 1704 | 0.00034 |
ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678 | 8 | 22 | 0.82 | 0.82 | 16131 | 0.008 |
ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678 | 0 | 1 | 0.061 | 0.061 | 1191 | 0.00024 |
ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678 | 4 | 1 | 0.061 | 0.061 | 1191 | 0.00029 |
ERR1204060__GENE__k99_110__flag=1__multi=69.0000__len=59237 | 8 | 51 | 0.85 | 0.85 | 50276 | 0.018 |
contig | type | start | end | orientation | annotation |
---|---|---|---|---|---|
NC_012781.1 | gene | 1 | 1362 | + | ID=gene-EUBREC_RS00010;Name=dnaA;gbkey=Gene;gene=dnaA;gene_biotype=protein_coding;locus_tag=EUBREC_RS00010;old_locus_tag=EUBREC_0001 |
NC_012781.1 | CDS | 1 | 1362 | + | ID=cds-WP_012740936.1;Parent=gene-EUBREC_RS00010;Dbxref=Genbank:WP_012740936.1;Name=WP_012740936.1;gbkey=CDS;gene=dnaA;inference=COORDINATES: similar to AA sequence:RefSeq:WP_012740936.1;locus_tag=EUBREC_RS00010;product=chromosomal replication initiator protein DnaA;protein_id=WP_012740936.1;transl_table=11 |
NC_012781.1 | gene | 1648 | 2760 | + | ID=gene-EUBREC_RS00015;Name=EUBREC_RS00015;gbkey=Gene;gene_biotype=protein_coding;locus_tag=EUBREC_RS00015;old_locus_tag=EUBREC_0002 |
NC_012781.1 | CDS | 1648 | 2760 | + | ID=cds-WP_012740937.1;Parent=gene-EUBREC_RS00015;Dbxref=Genbank:WP_012740937.1;Name=WP_012740937.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_015517736.1;locus_tag=EUBREC_RS00015;product=DNA polymerase III subunit beta;protein_id=WP_012740937.1;transl_table=11 |
NC_012781.1 | gene | 2769 | 2984 | + | ID=gene-EUBREC_RS00020;Name=EUBREC_RS00020;gbkey=Gene;gene_biotype=protein_coding;locus_tag=EUBREC_RS00020;old_locus_tag=EUBREC_0003 |
NC_012781.1 | CDS | 2769 | 2984 | + | ID=cds-WP_012740938.1;Parent=gene-EUBREC_RS00020;Dbxref=Genbank:WP_012740938.1;Name=WP_012740938.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_012740938.1;locus_tag=EUBREC_RS00020;product=RNA-binding S4 domain-containing protein;protein_id=WP_012740938.1;transl_table=11 |
NC_012781.1 | gene | 2984 | 4072 | + | ID=gene-EUBREC_RS00025;Name=recF;gbkey=Gene;gene=recF;gene_biotype=protein_coding;locus_tag=EUBREC_RS00025;old_locus_tag=EUBREC_0004 |
NC_012781.1 | CDS | 2984 | 4072 | + | ID=cds-WP_012740939.1;Parent=gene-EUBREC_RS00025;Dbxref=Genbank:WP_012740939.1;Name=WP_012740939.1;gbkey=CDS;gene=recF;inference=COORDINATES: similar to AA sequence:RefSeq:WP_012740939.1;locus_tag=EUBREC_RS00025;product=DNA replication/repair protein RecF;protein_id=WP_012740939.1;transl_table=11 |
NC_012781.1 | gene | 4065 | 6002 | + | ID=gene-EUBREC_RS00030;Name=gyrB;gbkey=Gene;gene=gyrB;gene_biotype=protein_coding;locus_tag=EUBREC_RS00030;old_locus_tag=EUBREC_0005 |
NC_012781.1 | CDS | 4065 | 6002 | + | ID=cds-WP_012740940.1;Parent=gene-EUBREC_RS00030;Dbxref=Genbank:WP_012740940.1;Name=WP_012740940.1;gbkey=CDS;gene=gyrB;inference=COORDINATES: similar to AA sequence:RefSeq:WP_006857737.1;locus_tag=EUBREC_RS00030;product=DNA topoisomerase (ATP-hydrolyzing) subunit B;protein_id=WP_012740940.1;transl_table=11 |
Abundance of each genome in a given specimen
abund | acc |
---|---|
0 | ERR1204060__GENE__k99_100__flag=1__multi=90.0000__len=44225 |
1.7e-05 | ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383 |
0 | ERR1204060__GENE__k99_103__flag=1__multi=73.0000__len=25356 |
5.1e-05 | ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999 |
0 | ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678 |
0 | ERR1204060__GENE__k99_110__flag=1__multi=69.0000__len=59237 |
0 | ERR1204060__GENE__k99_112__flag=1__multi=80.0000__len=17323 |
8.3e-05 | ERR1204060__GENE__k99_113__flag=1__multi=93.0000__len=37406 |
0.00013 | ERR1204060__GENE__k99_114__flag=1__multi=74.0000__len=73606 |
0 | ERR1204060__GENE__k99_115__flag=1__multi=61.0000__len=20828 |