Co Abundant Gene Clustering - Golob-Minot/geneshot GitHub Wiki

Background

One key feature of geneshot is the ability to group together genes that are 'co-abundant', or found at similar abundances across all of the samples in a dataset. This is an approach which has been described as a way to identify genes which are likely present within the same organism across many samples. It may also group together organisms which are always found together at similar relative abundances across all of the samples in an experiment. The geneshot workflow takes advantage of the Approximate Nearest Neighbor algorithm to make it more computationally tractable to identify co-abundant gene groups, which we described in a 2019 publication.

Options

While the default options should work well for most datasets, you can also customize the criteria used to identify co-abundant gene groups (CAGs).

--distance_metric: Distance metric used to group genes by co-abundance, default: cosine
--distance_threshold: Distance threshold used to group genes by co-abundance. Possible range is 0 - 1, with smaller values making smaller CAGs and larger values making larger CAGs. default: 0.5
--linkage_type: Linkage type used to group genes by co-abundance, default: average (linkage)

References:

Nielsen, H., Almeida, M., Juncker, A. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat Biotechnol 32, 822–828 (2014) doi:10.1038/nbt.2939
Minot, S.S., Willis, A.D. Clustering co-abundant genes identifies components of the gut microbiome that are reproducibly associated with colorectal cancer and inflammatory bowel disease. Microbiome 7, 110 (2019) doi:10.1186/s40168-019-0722-6