Genome Alignments - Golob-Minot/geneshot GitHub Wiki

Background

One of the ways to gain understanding from the results of a metagenomic analysis is to compare the assembled sequence information against a reference database of microbial genomes. With that external database of genomic sequences which are each thought to correspond to a single organism, longer contiguous sequences can be used to visualize the spatial organization of genetic elements which may have only assembled de novo into smaller fragments. In order to quickly process those alignments for geneshot results, a user may run the Annotation of Microbial Genomes by Microbiome Association (AMGMA) pipeline.

Concepts

Containment

One of the key outputs of AMGMA is the information summarizing which CAGs contain genes which align to which genomes. A term used frequently in this analysis is 'containment', which refers to the proportion of genes from two sets which are found in both. In this case we could refer to the proportion of genes from a single CAG which also align to a single genome, we could refer to the proportion of genes which align to a genome which also belong to a single CAG, and we could refer to the 'containment' of the CAG/genome as the proportion of genes from the union of both sets which are also in the intersection of those sets.

Abundance

After aligning the gene catalog generated by geneshot against a collection of reference genomes, AMGMA will estimate the relative abundance of the genes which align to each genome using the aggregate proportion of gene copies from each specimen which align to that genome. In this way, each genome is assigned an 'abundance' value for each specimen in the experiment.

Association

Using the abundance values for each genome, it is possible to estimate the association of the relative abundance of the organisms containing that group of genes with any experimental design. As implemented in AMGMA, the same formula used to describe an experimental design in a set of geneshot outputs will be applied to the AMGMA results, with the same set of estimated coefficients generated by corncob for each experimental parameter.

Indexing

Because each genome name can be quite long, an integer index is created for each genome and used to refer to it in many of the outputs. The integer index can be mapped back to the input genome using the /genomes/manifest table.

Contigs

In addition to the external genomes, AMGMA will align the gene catalog against the set of long contigs (above a given size threshold) generated de novo by geneshot. For this reason, users of AMGMA will see results for contig sequences (in which every contig name starts with the name of the specimen it was assembled from) in addition to the genomes which were input.

Output Files

The output of AMGMA consists of three files:

*.hdf5: Alignment information in HDF5 format (easily accessed with Python)
*.rdb: Alignment information in RDB format (used for visualization)
*.annotations.hdf5: Additional annotations for each genome (optional)

The output tables in the HDF5 file are as follows, with examples shown from a small dataset in which the experimental parameters are a series of different bacterial species labels:

Genome Manifest (`/genomes/manifest`)

Identifies the name and ID of each genome in the analysis.

index	id	name
0	ERR1204060__GENE__k99_100__flag=1__multi=90.0000__len=44225	ERR1204060__GENE__k99_100__flag=1__multi=90.0000__len=44225
1	ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383	ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383
2	ERR1204060__GENE__k99_103__flag=1__multi=73.0000__len=25356	ERR1204060__GENE__k99_103__flag=1__multi=73.0000__len=25356
3	ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999	ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999
4	ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678	ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678
5	ERR1204060__GENE__k99_110__flag=1__multi=69.0000__len=59237	ERR1204060__GENE__k99_110__flag=1__multi=69.0000__len=59237

Estimated Associations (`/stats/genome/corncob`)

Displays the estimated association of each genome (genome_ix) with each parameter of the experimental design.

genome_ix	parameter	estimate	p_value	std_error	q_value	neg_log10_qvalue	wald
158	speciesClostridium scindens	-27	1	5e+04	1	7.4e-09	-0.00054
158	speciesClostridium symbiosum	-27	1	4.2e+04	1	7.4e-09	-0.00064
158	speciesEubacterium rectale	-27	1	4.6e+04	1	7.4e-09	-0.00059
158	speciesRuminococcus gnavus	-4.5	4.7e-05	0.83	0.00025	3.6	-5.4
158	speciesRuminococcus torques	-27	1	3.8e+04	1	7.4e-09	-0.00071
159	speciesClostridium scindens	-30	1	1.2e+05	1	7.4e-09	-0.00025
159	speciesClostridium symbiosum	-29	1	9.8e+04	1	7.4e-09	-0.0003
159	speciesEubacterium rectale	-30	1	1.1e+05	1	7.4e-09	-0.00028

Detailed Alignments (`/genomes/detail/<CONTIG ID>`)

Displays the alignment of a gene catalog against a single genome (identified with the id from the manifest).

index	contig	gene	pident	contig_start	contig_end	contig_len	genome_id	CAG
2126	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	gene_534aa489_677aa	1e+02	22220	20190	31515	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	1
2127	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	gene_0476f716_627aa	1e+02	16501	14621	31515	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	1
2128	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	gene_c435bcdb_475aa	93	18598	20022	31515	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	1
2129	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	gene_5faa9628_467aa	1e+02	23588	24988	31515	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	1
2130	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	gene_bf5e732b_418aa	1e+02	25005	26258	31515	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	1
2131	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	gene_2632a2fa_422aa	1e+02	5344	4079	31515	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	1
2133	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	gene_5b1f5a24_414aa	1e+02	17017	18258	31515	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	1
2134	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	gene_8cd358e9_411aa	1e+02	28314	27082	31515	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	1
2135	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	gene_96932e78_362aa	1e+02	3119	2034	31515	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	1
2136	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	gene_8b2599b7_284aa	1e+02	7014	6163	31515	ERR1203923__GENE__k99_100__flag=1__multi=77.0000__len=31515	1

Containment (`/genomes/cags/containment`)

Describes the degree of overlap between CAG assignment of genes and the alignment of those genes against each genome.

genome	CAG	n_genes	containment	genome_prop	genome_bases	cag_prop
ERR1204060__GENE__k99_100__flag=1__multi=90.0000__len=44225	8	11	0.67	0.67	29631	0.004
ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383	8	50	0.9	0.9	45415	0.018
ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383	1	1	0.032	0.032	1602	0.00025
ERR1204060__GENE__k99_103__flag=1__multi=73.0000__len=25356	8	25	0.8	0.8	20211	0.009
ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999	8	43	0.8	0.8	31905	0.016
ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999	7	1	0.043	0.043	1704	0.00034
ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678	8	22	0.82	0.82	16131	0.008
ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678	0	1	0.061	0.061	1191	0.00024
ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678	4	1	0.061	0.061	1191	0.00029
ERR1204060__GENE__k99_110__flag=1__multi=69.0000__len=59237	8	51	0.85	0.85	50276	0.018

Genome Annotations (`/genomes/annotations/<GENOME ID>`)

contig	type	start	end	orientation	annotation
NC_012781.1	gene	1	1362	+	ID=gene-EUBREC_RS00010;Name=dnaA;gbkey=Gene;gene=dnaA;gene_biotype=protein_coding;locus_tag=EUBREC_RS00010;old_locus_tag=EUBREC_0001
NC_012781.1	CDS	1	1362	+	ID=cds-WP_012740936.1;Parent=gene-EUBREC_RS00010;Dbxref=Genbank:WP_012740936.1;Name=WP_012740936.1;gbkey=CDS;gene=dnaA;inference=COORDINATES: similar to AA sequence:RefSeq:WP_012740936.1;locus_tag=EUBREC_RS00010;product=chromosomal replication initiator protein DnaA;protein_id=WP_012740936.1;transl_table=11
NC_012781.1	gene	1648	2760	+	ID=gene-EUBREC_RS00015;Name=EUBREC_RS00015;gbkey=Gene;gene_biotype=protein_coding;locus_tag=EUBREC_RS00015;old_locus_tag=EUBREC_0002
NC_012781.1	CDS	1648	2760	+	ID=cds-WP_012740937.1;Parent=gene-EUBREC_RS00015;Dbxref=Genbank:WP_012740937.1;Name=WP_012740937.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_015517736.1;locus_tag=EUBREC_RS00015;product=DNA polymerase III subunit beta;protein_id=WP_012740937.1;transl_table=11
NC_012781.1	gene	2769	2984	+	ID=gene-EUBREC_RS00020;Name=EUBREC_RS00020;gbkey=Gene;gene_biotype=protein_coding;locus_tag=EUBREC_RS00020;old_locus_tag=EUBREC_0003
NC_012781.1	CDS	2769	2984	+	ID=cds-WP_012740938.1;Parent=gene-EUBREC_RS00020;Dbxref=Genbank:WP_012740938.1;Name=WP_012740938.1;gbkey=CDS;inference=COORDINATES: similar to AA sequence:RefSeq:WP_012740938.1;locus_tag=EUBREC_RS00020;product=RNA-binding S4 domain-containing protein;protein_id=WP_012740938.1;transl_table=11
NC_012781.1	gene	2984	4072	+	ID=gene-EUBREC_RS00025;Name=recF;gbkey=Gene;gene=recF;gene_biotype=protein_coding;locus_tag=EUBREC_RS00025;old_locus_tag=EUBREC_0004
NC_012781.1	CDS	2984	4072	+	ID=cds-WP_012740939.1;Parent=gene-EUBREC_RS00025;Dbxref=Genbank:WP_012740939.1;Name=WP_012740939.1;gbkey=CDS;gene=recF;inference=COORDINATES: similar to AA sequence:RefSeq:WP_012740939.1;locus_tag=EUBREC_RS00025;product=DNA replication/repair protein RecF;protein_id=WP_012740939.1;transl_table=11
NC_012781.1	gene	4065	6002	+	ID=gene-EUBREC_RS00030;Name=gyrB;gbkey=Gene;gene=gyrB;gene_biotype=protein_coding;locus_tag=EUBREC_RS00030;old_locus_tag=EUBREC_0005
NC_012781.1	CDS	4065	6002	+	ID=cds-WP_012740940.1;Parent=gene-EUBREC_RS00030;Dbxref=Genbank:WP_012740940.1;Name=WP_012740940.1;gbkey=CDS;gene=gyrB;inference=COORDINATES: similar to AA sequence:RefSeq:WP_006857737.1;locus_tag=EUBREC_RS00030;product=DNA topoisomerase (ATP-hydrolyzing) subunit B;protein_id=WP_012740940.1;transl_table=11

Abundances (`/genome/abund/raw/<SPECIMEN>`)

Abundance of each genome in a given specimen

abund	acc
0	ERR1204060__GENE__k99_100__flag=1__multi=90.0000__len=44225
1.7e-05	ERR1204060__GENE__k99_102__flag=1__multi=84.0000__len=50383
0	ERR1204060__GENE__k99_103__flag=1__multi=73.0000__len=25356
5.1e-05	ERR1204060__GENE__k99_108__flag=1__multi=76.0000__len=39999
0	ERR1204060__GENE__k99_109__flag=1__multi=72.0000__len=19678
0	ERR1204060__GENE__k99_110__flag=1__multi=69.0000__len=59237
0	ERR1204060__GENE__k99_112__flag=1__multi=80.0000__len=17323
8.3e-05	ERR1204060__GENE__k99_113__flag=1__multi=93.0000__len=37406
0.00013	ERR1204060__GENE__k99_114__flag=1__multi=74.0000__len=73606
0	ERR1204060__GENE__k99_115__flag=1__multi=61.0000__len=20828

Genome Alignments - Golob-Minot/geneshot GitHub Wiki

Background

Concepts

Containment

Abundance

Association

Indexing

Contigs

Output Files

Genome Manifest (/genomes/manifest)

Estimated Associations (/stats/genome/corncob)

Detailed Alignments (/genomes/detail/<CONTIG ID>)

Containment (/genomes/cags/containment)

Genome Annotations (/genomes/annotations/<GENOME ID>)

Abundances (/genome/abund/raw/<SPECIMEN>)