Statistical Analysis - Golob-Minot/geneshot GitHub Wiki
We have implemented support for statistical analysis of microbiome survey datasets directly into the geneshot
tool. This analysis is intended to help the user identify those CAGs (groups of co-abundant microbial genes) whose relative abundance is significantly associated with any of the metadata features provided by the user in the manifest.
In order to take advantage of this optional feature, the user needs to provide information with the --formula
flag, and also provide the needed labels in the manifest.
For example, let us consider a manifest CSV which contains the following information:
specimen | R1 | R2 | participant | disease | bristol |
---|---|---|---|---|---|
pA_s1 | <> | <> | pA | 0 | 3 |
pA_s2 | <> | <> | pA | 0 | 6 |
... | <> | <> | ... | ... | ... |
pZ_s5 | <> | <> | pZ | 0 | 4 |
pZ_s9 | <> | <> | pZ | 1 | 3 |
In this experiment we have obtained multiple microbiome samples from multiple participants. Each participant has some samples from times when they experienced some transient disease process. The Bristol score has also been recorded for each sample.
It is recommended that binary variables are coded as 0 / 1, categorical variables are coded as strings, and that continuous variables are coded as floats.
This is one particular experimental design used for illustrative purposes, and likely does not fit your experiment.
In order to enable the statistical analysis, use the --formula flag. This formula will be used to run Corncob on each CAG individually, testing for association with those features described in the manifest. Multiple formulae may be specified as a comma-delimited list.
Examples:
-
--formula "disease"
: Test for the association of the relative abundance of every CAG with the binarydisease
label -
--formula "disease + participant"
: Test for the association of the relative abundance of every CAG with the binarydisease
label, allowing for the intercept to vary by participant (because it is a categorical variable in the provided table) -
--formula "disease,participant"
: In two independent models, test for the association of the relative abundance of every CAG with (a) the binarydisease
label and (b) the categoricalparticipant
label -
--formula "disease + bristol + participant"
: Test for the association of the relative abundance of every CAG with the binarydisease
label and the continuousbristol
label, while allowing the intercept to vary by participant -
--formula "disease * bristol + participant"
: Test for the association of the relative abundance of every CAG with the binarydisease
label and the continuousbristol
label, while allowing for an interaction term between disease:bristol, while also allowing the intercept to vary by participant
If the user provides this --formula
flag, the first step of geneshot
will be to perform a dry run and ensure that this test can be executed with the manifest provided. Importantly, this test must pass before any large-scale compute is allowed to start.