Notes on Metadata - Golob-Minot/geneshot GitHub Wiki
Some general guidance on how to best format metadata in the manifest
:
Column Names
It’s best to have every column name be a single string with no spaces. Having a description of how values are encoded (e.g. 0=no, 1=yes) is a great practice, but it’s probably best to put that information in a separate file.
Strings vs. Numbers
To prevent confusion, it’s best to have the entries in a column be either all numbers, or all strings. It’s fine to have numbers included in strings (e.g. Day0), of course. Any columns with strings will be treated as categorical variables, while any columns which are all numbers will be treated as continuous (or binary). When deciding which to use, you can consider what the two types of variables can be used for.
Categorical Variables
Estimating Coefficients of Association
When a categorical variable is being used as part of a “formula” to calculate the association of microbial abundance with your experimental design, the groups are sorted alphabetically and then every group is compared to the first group in a pairwise fashion. For example, in a column with ‘day0’, ‘day1’, and ‘day2’, the associations will be calculated comparing the groups day1 vs. day0
and day2 vs. day0
.
Plotting
When a categorical variable is being used for plotting purposes, each set of values is rendered as a distinct group. With the example above used to render a color scale, day0, day1, and day2 would all have completely distinct colors. When used as the x-axis, values from each unique category would be placed in a distinct group.
Continuous Variables
Estimating Coefficients of Association
When a continuous variable is being used as part of a “formula” to calculate the association of microbial abundance with your experimental design, the variable is used as part of a linear regression. In a column labeled ‘day’ with values 0, 1, and 2, the associations will be calculated for the linear association with day
.
Continuous Variables – Plotting
When a continuous variable is being used for plotting purposes to form a color scale, it is plotted on a single continuous color scale. When used as the x-axis, each value is laid out on a continuous numeric scale.