Data Management Plan (DMP) - egenomics/agb2025 GitHub Wiki

Project: Microbiota Classification Pipeline – HdMBioinfo Group

Institution: Hospital del Mar

Primary Use Case: Classify stool samples as healthy or non-healthy for clinical intervention

1. Types of Data Generated

1.1. Metadata

Structured information describing each sample and run. Used for interpretation and downstream analysis.

Sample Metadata
Clinical and contextual data per sample (e.g., ID, age, sex, BMI, diagnosis, treatment, collection date).
Run Metadata
Technical information per sequencing run (e.g., pipeline version, operator, processing date, parameters).
Merged Metadata
Combined clinical + technical + QC metrics for each sample.

1.2. Raw Sequencing Data

Primary data from the sequencing instrument, used as input for downstream processing.

Paired-End FASTQ Files
Raw sequencing reads for each sample (R1 and R2, usually .fastq.gz).

1.3. Cleaned and Filtered Sequencing Data

Improved sequencing data, processed for quality and contaminants.

Trimmed and Filtered Reads
Reads with adapters and low-quality bases removed (.fastq.gz).
Contamination-Free Reads (optional)
Reads with contaminants removed (e.g., via Kraken2).

1.4. Quality Control Data

Reports and summaries that assess sequencing quality.

FastQC Reports
Per-sample quality reports (.html, .zip).
MultiQC Report
Aggregated report summarizing all QC metrics (multiqc_report.html).
Summary Tables
Tabular metrics (CSV/TSV) like total reads, %GC, duplication, etc.

1.5. Intermediate Analysis Artifacts (QIIME 2)

Internal data objects produced by QIIME 2 for analysis.

QIIME Artifacts (.qza)
Feature tables, rep-seqs, denoising stats, taxonomy, trees, distance matrices.
QIIME Visualizations (.qzv)
Interactive views of feature tables, taxonomy bar plots, diversity metrics, etc.

1.6. Analysis Results

Biological and statistical results of microbiome analysis.

Taxonomic Profiles
Tables or plots showing microbial composition per sample/group.
Diversity Metrics
Alpha (e.g., Shannon, Simpson) and beta diversity (e.g., UniFrac, Bray-Curtis).
Differential Abundance
Statistically significant taxa (e.g., via DESeq2, ANCOM).
Correlation Results
Associations between microbial features and clinical/lifestyle variables.

1.7. Reports and Visualizations

User-friendly summaries for interpretation and dissemination.

Plots and Figures
Barplots, boxplots, scatterplots, PCoA/NMDS (static or interactive).
Final Reports
Downloadable reports (PDF, HTML, XLSX) with metadata, key findings, and figures.

Summary Table

Category	Description	Examples
Metadata	Contextual info on samples and runs	`metadata_sample.csv`, `run_metadata.csv`
Raw Data	Raw sequencing reads	Paired-end `.fastq.gz` files
Cleaned Data	Filtered/trimmed reads	Clean `.fastq.gz` files
QC Data	Quality reports and metrics	FastQC, MultiQC, CSV summaries
Intermediate Artifacts	QIIME 2 internal data	`.qza`, `.qzv` files
Analysis Results	Biological/statistical insights	Diversity metrics, taxonomy tables, correlations
Reports & Visuals	Human-readable summaries	PDFs, HTML reports, charts, dashboards

2. Data Storage Strategy

Storage Location

Primary Storage:

Servers within the hospital IT infrastructure.

Accessible only to authorized personnel within the HdMBioinfo group and Digestology Unit.

Capacity Planning

Initial capacity of ~2 TB, based on estimated FastQ size and intermediate data from MiSeq runs.

Backup Protocols

Automated nightly Backups to a local server.

Integrity Checks: Regular checksum (MD5/SHA256) validation for critical datasets.

3. Data Retention and Lifecycle Policy

Retention Periods

Data Type	Retention Duration	Notes
Raw FastQ files	3 years	Long-term for reproducibility and regulatory reasons
Intermediate files	1 year	Renewable if associated project remains active
Metadata from hospital DB	3 years	Synchronized with clinical data retention norms
Final analysis results	3 years	Includes classification outputs and reports
Pipeline Logs and Execution Records	3 years	For reproducibility

Deletion and Archiving Criteria

Deletion: Automatic purging of intermediate files after 1 year unless flagged.

Archiving: Final results and raw data archived after 1 year of inactivity.

Annual review for inactive projects to determine continued storage or purging.

4. Access control policies

The following table summarizes who has access to each type of data across the microbiome pipeline. Access control follows the principle of least privilege: each group can read or edit only the data they need to perform their tasks.

Group 1 manages metadata and ensures its consistency. Only they can modify metadata files.
Group 2A handles raw sequencing and preprocessing, with exclusive write access to FASTQ and QC outputs.
Group 2B runs taxonomic and diversity analysis using cleaned data and produces intermediate QIIME outputs.
Group 3 builds visualizations and reports. They rely on metadata and analysis outputs but don’t modify upstream data.
Project Leads have read access to all stages for oversight and coordination.
Public access is allowed only for anonymized visualizations and final reports.

Sensitive metadata (e.g., clinical details) should be shared via secured drives and excluded from public repositories.

Whenever possible, access is managed using GitHub roles and file-specific permissions within shared folders.

Data Type	G1	G2A	G2B	G3	Leads	Public
Sample Metadata	RW	R	R	R	R	❌
Raw FASTQ	R	RW	R	❌	R	❌
Filtered FASTQ	❌	RW	R	❌	❌	❌
QC Reports	❌	RW	❌	R	❌	❌
QIIME artifacts	❌	❌	RW	R	❌	❌
Analysis results	❌	❌	RW	RW	R	❌
Dashboard & Reports	❌	❌	❌	RW	R	✅ (if anonymized)

5. Data Security Measures

To protect sensitive patient data throughout the microbiome analysis pipeline, the following data security measures are in place:

Anonymization: All patient-identifying information (e.g., names, ID numbers, contact info) must be removed or replaced with pseudonyms before data is shared or uploaded.
Secure Storage:
- Sensitive metadata is stored in encrypted and access-controlled cloud storage (e.g., Google Drive with restricted access).
- Raw and processed data are stored on institutional servers or GitHub repositories with proper access controls.
- No personal health data is committed to public repositories.
Access Control:
- Only authorized team members have access to identifiable or sensitive data.
- Access levels are assigned according to group roles and enforced via GitHub and shared drive permissions.
Data Transfer:
- Data is only shared through secure platforms (e.g., encrypted drives, password-protected links).
- No patient data is sent via email or unencrypted messaging platforms.
Audit and Logging:
- Access logs and activity histories are maintained where possible (e.g., Google Drive, GitHub) to track who accessed or modified files.
Ethical Compliance:
- All team members are expected to follow ethical guidelines and institutional data protection policies, including GDPR if applicable.

Reminder: Always double-check that any files made public (e.g., visualizations, reports) do not contain direct or indirect identifiers.

6. Data Sharing Policies

To ensure responsible and reproducible research, the following data sharing policies apply:

Internal Collaboration:
- All project data is shared through centralized platforms (e.g., GitHub, institutional drives) with controlled access based on group roles.
- Only non-sensitive data (e.g., pipeline outputs, summaries) are shared openly within the team.
- Sensitive metadata is shared only with relevant team members via encrypted, access-controlled folders.
External Collaborators:
- Collaborators outside the core team must sign data use agreements before receiving access to any patient-related data.
- Only pseudonymized or aggregated data will be shared externally unless explicit consent and ethical approval are in place.
Publication and Public Release:
- Raw sequencing data and relevant processed outputs may be deposited in public repositories (e.g., ENA, NCBI SRA) after de-identification and approval.
- Sample metadata shared publicly will be minimal and fully anonymized.
- Visualizations and summary reports (e.g., dashboard outputs, PDF reports) may be shared publicly as long as they contain no identifiable patient information.
Licensing:
- Public data and code will be released under an open license (e.g., CC BY 4.0 or MIT), unless restricted by data source agreements.

Before sharing any data externally, ensure that patient privacy is preserved and institutional policies are followed.

7. Versioning and Documentation Strategy

Proper versioning and documentation are essential to ensure reproducibility, traceability, and regulatory compliance in the microbiota classification pipeline. This section outlines how both raw data and analysis workflows are tracked over time.

Data Versioning

To ensure that each analysis is reproducible and auditable, versioning will be implemented at both the data and pipeline levels:

Pipeline Versioning
- Each release of the analysis pipeline will be versioned using semantic versioning (e.g., v1.2.0).
- Major changes (e.g., algorithm updates or parameter overhauls) will increment the major version.
- Minor improvements or non-breaking updates will increment the minor version.
Data Version Control
- Raw and processed datasets will be tracked using DVC (Data Version Control).
- Each dataset snapshot will be associated with:
  - A commit hash from the Git repository
  - Pipeline version used
Sample-Level Tracking
- Each sample is assigned a persistent, unique SampleID from the hospital barcode system.
- Analysis outputs are tagged with metadata including the sample origin, sequencing run ID, and processing date.

Documentation Protocols

All datasets and pipeline runs will be documented through standardized and automated methods:

Pipeline Execution Logs
- Every run generates logs capturing:
  - Execution timestamp
  - User/analyst ID
  - Hardware/compute environment
  - Parameters used
- Logs stored as .log files alongside output directories.
Metadata Records
- Structured using JSON/YAML/CSV.
- Fields include:
  - SampleID
  - Collection date
  - DNA extraction protocol
  - Sequencing run details
  - ...
README Files
- Each analysis directory will include a README.md detailing:
  - Purpose of the run
  - Date and responsible person
  - Pipeline version
  - Key findings or issues
CHANGELOGs
- Maintained at the repository level.
- Records each update to the pipeline with timestamps, authors, and descriptions of changes.