Metadata - egenomics/agb2025 GitHub Wiki
For testing, the templates for metadata are two flat tables for every run:
- Sample level metadata : one row per stool specimen (
Sample_ID
). - Run level metadata : one row per sequencing batch (
RunID
), repeated per sample to keep both tables tidy and linkable.
Sample level variables
Variable | Description / Allowed Values |
---|---|
Sample_Accession |
Preferred format: SAF1030525 (S + 2 letters + 1 digit + DDMMYY). Legacy SHM01180525 still accepted. |
Institution |
Title Case, spaces → _ (Hospital_del_mar ) |
Department |
Title Case, spaces → _ (Digestology_unit ) |
Collection_Date |
ISO YYYY-MM-DD |
Collection_Storage_Temp |
-80 , -20 , 4 , Room_Temperature |
Analyst_Processor_Name |
Title Case, spaces → _ |
Gender |
Female , Male , Not_specified |
Age |
Integer or N/A |
Ongoing_conditions |
Controlled list (IBD, IBS, NAFLD, …) |
Appendix_removed |
Yes , No |
Allergies |
Controlled list (Peanuts, Penicillin, …) |
Dietary_Information |
Omnivore , Vegetarian , Vegan , Keto , … |
Bowel_movement_quality |
Constipated , Normal , Diarrhea |
Antibiotic_intake |
Week , Month , 6_months , Year , Past_year |
Medications |
Controlled list (PPIs, NSAIDs, …) |
Cancer |
Yes , No |
Body_Mass_Index |
Underweight_<18.5 , Normal_18.5-24.9 , Overweight_25-29.9 , Obese_>=30 |
Exercise_frequency |
Never , Rarely , 1-2_times_per_week , 3-5_times_per_week , Daily |
Smoking_status |
Smoker , Non-smoker |
Daily_cigarettes |
1-5 , 6-10 , 11-15 , 16-20 , +20 |
Alcohol_consumption |
Yes , No |
Frequency_of_alcohol_consumption |
Drinks per week (integer) |
Notes_Samples |
Free text (Title Case, _ for spaces) |
Run level variables
Variable | Description / Allowed Values |
---|---|
RunID |
Preferred format: R01030525 (R + batch 01-99 + DDMMYY). Legacy RHM03_01 still accepted. |
Sample ID |
Study accession (ERR1328369 ) |
Sequencing_Date |
ISO YYYY-MM-DD |
Sequencing_Platform |
MiSeq , NextSeq , MinION , … |
Sequencing_Type |
16S_rRNA , Shotgun_metagenomic , RNA-Seq , … |
Expected_read_length |
2x75_bp , 2x150_bp , … |
Sequencing_depth_target |
<1M , 1-5M , 5-10M , 10-20M , 20-50M , 50M |
Library_preparation_kit |
Nextera_XT , TruSeq_Nano , Swift_16S , … |
Technician_name |
Title Case, spaces → _ |
Notes_Runs |
Free text |
Sample_Accession |
Foreign key to sample table |
Healthy-control flag Each sample is compared against
healthy_controls_AGB2025.csv
downstream; Group 1 only supplies the reference list.
Our approach for the real implementation is that the run metadata is stored in the SQL database by the hospital technicians and afterwards we query the database by the FASTQs IDs analyzed in the run to retrieve all the metadata (both sample and technical). So we automatically retrieve the metadata.csv used in the analysis, the fields are exactly the sames contained in the templates.