Metadata - egenomics/agb2025 GitHub Wiki

For testing, the templates for metadata are two flat tables for every run:

Sample level metadata : one row per stool specimen (Sample_ID).
Run level metadata : one row per sequencing batch (RunID), repeated per sample to keep both tables tidy and linkable.

Sample level variables

Variable	Description / Allowed Values
`Sample_Accession`	Preferred format: `SAF1030525` (`S` + 2 letters + 1 digit + DDMMYY). Legacy `SHM01180525` still accepted.
`Institution`	Title Case, spaces → `_` (`Hospital_del_mar`)
`Department`	Title Case, spaces → `_` (`Digestology_unit`)
`Collection_Date`	ISO `YYYY-MM-DD`
`Collection_Storage_Temp`	`-80`, `-20`, `4`, `Room_Temperature`
`Analyst_Processor_Name`	Title Case, spaces → `_`
`Gender`	`Female`, `Male`, `Not_specified`
`Age`	Integer or `N/A`
`Ongoing_conditions`	Controlled list (IBD, IBS, NAFLD, …)
`Appendix_removed`	`Yes`, `No`
`Allergies`	Controlled list (Peanuts, Penicillin, …)
`Dietary_Information`	`Omnivore`, `Vegetarian`, `Vegan`, `Keto`, …
`Bowel_movement_quality`	`Constipated`, `Normal`, `Diarrhea`
`Antibiotic_intake`	`Week`, `Month`, `6_months`, `Year`, `Past_year`
`Medications`	Controlled list (PPIs, NSAIDs, …)
`Cancer`	`Yes`, `No`
`Body_Mass_Index`	`Underweight_<18.5`, `Normal_18.5-24.9`, `Overweight_25-29.9`, `Obese_>=30`
`Exercise_frequency`	`Never`, `Rarely`, `1-2_times_per_week`, `3-5_times_per_week`, `Daily`
`Smoking_status`	`Smoker`, `Non-smoker`
`Daily_cigarettes`	`1-5`, `6-10`, `11-15`, `16-20`, `+20`
`Alcohol_consumption`	`Yes`, `No`
`Frequency_of_alcohol_consumption`	Drinks per week (integer)
`Notes_Samples`	Free text (Title Case, `_` for spaces)

Run level variables

Variable	Description / Allowed Values
`RunID`	Preferred format: `R01030525` (`R` + batch 01-99 + DDMMYY). Legacy `RHM03_01` still accepted.
`Sample ID`	Study accession (`ERR1328369`)
`Sequencing_Date`	ISO `YYYY-MM-DD`
`Sequencing_Platform`	`MiSeq`, `NextSeq`, `MinION`, …
`Sequencing_Type`	`16S_rRNA`, `Shotgun_metagenomic`, `RNA-Seq`, …
`Expected_read_length`	`2x75_bp`, `2x150_bp`, …
`Sequencing_depth_target`	`<1M`, `1-5M`, `5-10M`, `10-20M`, `20-50M`, `50M`
`Library_preparation_kit`	`Nextera_XT`, `TruSeq_Nano`, `Swift_16S`, …
`Technician_name`	Title Case, spaces → `_`
`Notes_Runs`	Free text
`Sample_Accession`	Foreign key to sample table

Healthy-control flag Each sample is compared against healthy_controls_AGB2025.csv downstream; Group 1 only supplies the reference list.

Our approach for the real implementation is that the run metadata is stored in the SQL database by the hospital technicians and afterwards we query the database by the FASTQs IDs analyzed in the run to retrieve all the metadata (both sample and technical). So we automatically retrieve the metadata.csv used in the analysis, the fields are exactly the sames contained in the templates.