Data Management Plan (DMP) - egenomics/agb2025 GitHub Wiki
Project: Microbiota Classification Pipeline β HdMBioinfo Group
Institution: Hospital del Mar
Primary Use Case: Classify stool samples as healthy or non-healthy for clinical intervention
1. Types of Data Generated
1.1. Metadata
Structured information describing each sample and run. Used for interpretation and downstream analysis.
-
Sample Metadata
Clinical and contextual data per sample (e.g., ID, age, sex, BMI, diagnosis, treatment, collection date). -
Run Metadata
Technical information per sequencing run (e.g., pipeline version, operator, processing date, parameters). -
Merged Metadata
Combined clinical + technical + QC metrics for each sample.
1.2. Raw Sequencing Data
Primary data from the sequencing instrument, used as input for downstream processing.
- Paired-End FASTQ Files
Raw sequencing reads for each sample (R1
andR2
, usually.fastq.gz
).
1.3. Cleaned and Filtered Sequencing Data
Improved sequencing data, processed for quality and contaminants.
-
Trimmed and Filtered Reads
Reads with adapters and low-quality bases removed (.fastq.gz
). -
Contamination-Free Reads (optional)
Reads with contaminants removed (e.g., via Kraken2).
1.4. Quality Control Data
Reports and summaries that assess sequencing quality.
-
FastQC Reports
Per-sample quality reports (.html
,.zip
). -
MultiQC Report
Aggregated report summarizing all QC metrics (multiqc_report.html
). -
Summary Tables
Tabular metrics (CSV/TSV) like total reads, %GC, duplication, etc.
1.5. Intermediate Analysis Artifacts (QIIME 2)
Internal data objects produced by QIIME 2 for analysis.
-
QIIME Artifacts (.qza)
Feature tables, rep-seqs, denoising stats, taxonomy, trees, distance matrices. -
QIIME Visualizations (.qzv)
Interactive views of feature tables, taxonomy bar plots, diversity metrics, etc.
1.6. Analysis Results
Biological and statistical results of microbiome analysis.
-
Taxonomic Profiles
Tables or plots showing microbial composition per sample/group. -
Diversity Metrics
Alpha (e.g., Shannon, Simpson) and beta diversity (e.g., UniFrac, Bray-Curtis). -
Differential Abundance
Statistically significant taxa (e.g., via DESeq2, ANCOM). -
Correlation Results
Associations between microbial features and clinical/lifestyle variables.
1.7. Reports and Visualizations
User-friendly summaries for interpretation and dissemination.
-
Plots and Figures
Barplots, boxplots, scatterplots, PCoA/NMDS (static or interactive). -
Final Reports
Downloadable reports (PDF, HTML, XLSX) with metadata, key findings, and figures.
Summary Table
Category | Description | Examples |
---|---|---|
Metadata | Contextual info on samples and runs | metadata_sample.csv , run_metadata.csv |
Raw Data | Raw sequencing reads | Paired-end .fastq.gz files |
Cleaned Data | Filtered/trimmed reads | Clean .fastq.gz files |
QC Data | Quality reports and metrics | FastQC, MultiQC, CSV summaries |
Intermediate Artifacts | QIIME 2 internal data | .qza , .qzv files |
Analysis Results | Biological/statistical insights | Diversity metrics, taxonomy tables, correlations |
Reports & Visuals | Human-readable summaries | PDFs, HTML reports, charts, dashboards |
2. Data Storage Strategy
Storage Location
Primary Storage:
Servers within the hospital IT infrastructure.
Accessible only to authorized personnel within the HdMBioinfo group and Digestology Unit.
Capacity Planning
Initial capacity of ~2 TB, based on estimated FastQ size and intermediate data from MiSeq runs.
Backup Protocols
Automated nightly Backups to a local server.
Integrity Checks: Regular checksum (MD5/SHA256) validation for critical datasets.
3. Data Retention and Lifecycle Policy
Retention Periods
Data Type | Retention Duration | Notes |
---|---|---|
Raw FastQ files | 3 years | Long-term for reproducibility and regulatory reasons |
Intermediate files | 1 year | Renewable if associated project remains active |
Metadata from hospital DB | 3 years | Synchronized with clinical data retention norms |
Final analysis results | 3 years | Includes classification outputs and reports |
Pipeline Logs and Execution Records | 3 years | For reproducibility |
Deletion and Archiving Criteria
Deletion: Automatic purging of intermediate files after 1 year unless flagged.
Archiving: Final results and raw data archived after 1 year of inactivity.
Annual review for inactive projects to determine continued storage or purging.
4. Access control policies
The following table summarizes who has access to each type of data across the microbiome pipeline. Access control follows the principle of least privilege: each group can read or edit only the data they need to perform their tasks.
- Group 1 manages metadata and ensures its consistency. Only they can modify metadata files.
- Group 2A handles raw sequencing and preprocessing, with exclusive write access to FASTQ and QC outputs.
- Group 2B runs taxonomic and diversity analysis using cleaned data and produces intermediate QIIME outputs.
- Group 3 builds visualizations and reports. They rely on metadata and analysis outputs but donβt modify upstream data.
- Project Leads have read access to all stages for oversight and coordination.
- Public access is allowed only for anonymized visualizations and final reports.
Sensitive metadata (e.g., clinical details) should be shared via secured drives and excluded from public repositories.
Whenever possible, access is managed using GitHub roles and file-specific permissions within shared folders.
Data Type | G1 | G2A | G2B | G3 | Leads | Public |
---|---|---|---|---|---|---|
Sample Metadata | RW | R | R | R | R | β |
Raw FASTQ | R | RW | R | β | R | β |
Filtered FASTQ | β | RW | R | β | β | β |
QC Reports | β | RW | β | R | β | β |
QIIME artifacts | β | β | RW | R | β | β |
Analysis results | β | β | RW | RW | R | β |
Dashboard & Reports | β | β | β | RW | R | β (if anonymized) |
5. Data Security Measures
To protect sensitive patient data throughout the microbiome analysis pipeline, the following data security measures are in place:
-
Anonymization: All patient-identifying information (e.g., names, ID numbers, contact info) must be removed or replaced with pseudonyms before data is shared or uploaded.
-
Secure Storage:
- Sensitive metadata is stored in encrypted and access-controlled cloud storage (e.g., Google Drive with restricted access).
- Raw and processed data are stored on institutional servers or GitHub repositories with proper access controls.
- No personal health data is committed to public repositories.
-
Access Control:
- Only authorized team members have access to identifiable or sensitive data.
- Access levels are assigned according to group roles and enforced via GitHub and shared drive permissions.
-
Data Transfer:
- Data is only shared through secure platforms (e.g., encrypted drives, password-protected links).
- No patient data is sent via email or unencrypted messaging platforms.
-
Audit and Logging:
- Access logs and activity histories are maintained where possible (e.g., Google Drive, GitHub) to track who accessed or modified files.
-
Ethical Compliance:
- All team members are expected to follow ethical guidelines and institutional data protection policies, including GDPR if applicable.
Reminder: Always double-check that any files made public (e.g., visualizations, reports) do not contain direct or indirect identifiers.
6. Data Sharing Policies
To ensure responsible and reproducible research, the following data sharing policies apply:
-
Internal Collaboration:
- All project data is shared through centralized platforms (e.g., GitHub, institutional drives) with controlled access based on group roles.
- Only non-sensitive data (e.g., pipeline outputs, summaries) are shared openly within the team.
- Sensitive metadata is shared only with relevant team members via encrypted, access-controlled folders.
-
External Collaborators:
- Collaborators outside the core team must sign data use agreements before receiving access to any patient-related data.
- Only pseudonymized or aggregated data will be shared externally unless explicit consent and ethical approval are in place.
-
Publication and Public Release:
- Raw sequencing data and relevant processed outputs may be deposited in public repositories (e.g., ENA, NCBI SRA) after de-identification and approval.
- Sample metadata shared publicly will be minimal and fully anonymized.
- Visualizations and summary reports (e.g., dashboard outputs, PDF reports) may be shared publicly as long as they contain no identifiable patient information.
-
Licensing:
- Public data and code will be released under an open license (e.g., CC BY 4.0 or MIT), unless restricted by data source agreements.
Before sharing any data externally, ensure that patient privacy is preserved and institutional policies are followed.
7. Versioning and Documentation Strategy
Proper versioning and documentation are essential to ensure reproducibility, traceability, and regulatory compliance in the microbiota classification pipeline. This section outlines how both raw data and analysis workflows are tracked over time.
Data Versioning
To ensure that each analysis is reproducible and auditable, versioning will be implemented at both the data and pipeline levels:
-
Pipeline Versioning
- Each release of the analysis pipeline will be versioned using semantic versioning (e.g.,
v1.2.0
). - Major changes (e.g., algorithm updates or parameter overhauls) will increment the major version.
- Minor improvements or non-breaking updates will increment the minor version.
- Each release of the analysis pipeline will be versioned using semantic versioning (e.g.,
-
Data Version Control
- Raw and processed datasets will be tracked using DVC (Data Version Control).
- Each dataset snapshot will be associated with:
- A commit hash from the Git repository
- Pipeline version used
-
Sample-Level Tracking
- Each sample is assigned a persistent, unique SampleID from the hospital barcode system.
- Analysis outputs are tagged with metadata including the sample origin, sequencing run ID, and processing date.
Documentation Protocols
All datasets and pipeline runs will be documented through standardized and automated methods:
-
Pipeline Execution Logs
- Every run generates logs capturing:
- Execution timestamp
- User/analyst ID
- Hardware/compute environment
- Parameters used
- Logs stored as
.log
files alongside output directories.
- Every run generates logs capturing:
-
Metadata Records
- Structured using JSON/YAML/CSV.
- Fields include:
- SampleID
- Collection date
- DNA extraction protocol
- Sequencing run details
- ...
-
README Files
- Each analysis directory will include a
README.md
detailing:- Purpose of the run
- Date and responsible person
- Pipeline version
- Key findings or issues
- Each analysis directory will include a
-
CHANGELOGs
- Maintained at the repository level.
- Records each update to the pipeline with timestamps, authors, and descriptions of changes.