Metadata - egenomics/agb2025 GitHub Wiki

For testing, the templates for metadata are two flat tables for every run:

  • Sample level metadata : one row per stool specimen (Sample_ID).
  • Run level metadata : one row per sequencing batch (RunID), repeated per sample to keep both tables tidy and linkable.

Sample level variables

Variable Description / Allowed Values
Sample_Accession Preferred format: SAF1030525 (S + 2 letters + 1 digit + DDMMYY). Legacy SHM01180525 still accepted.
Institution Title Case, spaces → _ (Hospital_del_mar)
Department Title Case, spaces → _ (Digestology_unit)
Collection_Date ISO YYYY-MM-DD
Collection_Storage_Temp -80, -20, 4, Room_Temperature
Analyst_Processor_Name Title Case, spaces → _
Gender Female, Male, Not_specified
Age Integer or N/A
Ongoing_conditions Controlled list (IBD, IBS, NAFLD, …)
Appendix_removed Yes, No
Allergies Controlled list (Peanuts, Penicillin, …)
Dietary_Information Omnivore, Vegetarian, Vegan, Keto, …
Bowel_movement_quality Constipated, Normal, Diarrhea
Antibiotic_intake Week, Month, 6_months, Year, Past_year
Medications Controlled list (PPIs, NSAIDs, …)
Cancer Yes, No
Body_Mass_Index Underweight_<18.5, Normal_18.5-24.9, Overweight_25-29.9, Obese_>=30
Exercise_frequency Never, Rarely, 1-2_times_per_week, 3-5_times_per_week, Daily
Smoking_status Smoker, Non-smoker
Daily_cigarettes 1-5, 6-10, 11-15, 16-20, +20
Alcohol_consumption Yes, No
Frequency_of_alcohol_consumption Drinks per week (integer)
Notes_Samples Free text (Title Case, _ for spaces)

Run level variables

Variable Description / Allowed Values
RunID Preferred format: R01030525 (R + batch 01-99 + DDMMYY). Legacy RHM03_01 still accepted.
Sample ID Study accession (ERR1328369)
Sequencing_Date ISO YYYY-MM-DD
Sequencing_Platform MiSeq, NextSeq, MinION, …
Sequencing_Type 16S_rRNA, Shotgun_metagenomic, RNA-Seq, …
Expected_read_length 2x75_bp, 2x150_bp, …
Sequencing_depth_target <1M, 1-5M, 5-10M, 10-20M, 20-50M, 50M
Library_preparation_kit Nextera_XT, TruSeq_Nano, Swift_16S, …
Technician_name Title Case, spaces → _
Notes_Runs Free text
Sample_Accession Foreign key to sample table

Healthy-control flag Each sample is compared against healthy_controls_AGB2025.csv downstream; Group 1 only supplies the reference list.

Our approach for the real implementation is that the run metadata is stored in the SQL database by the hospital technicians and afterwards we query the database by the FASTQs IDs analyzed in the run to retrieve all the metadata (both sample and technical). So we automatically retrieve the metadata.csv used in the analysis, the fields are exactly the sames contained in the templates.