Directory Structure - egenomics/agb2025 GitHub Wiki
The pipeline follows a clean, modular structure, organized as follows:
HdMBioinfo-MicrobiotaPipeline/
├── main.nf # Entry point of the pipeline
├── nextflow.config
│
├── scripts/ # Helper scripts and small executables
│ ├── convert_qiime_to_long.py
│ ├── api_csv.py
│ ├── metadata_parsing.R
│ ├── Reorder.R
│ └── process_metadata.py
│
├── modules/ # DSL2-style modules for each pipeline step
│ ├── local/
│ │ ├── import_reads.nf
│ │ ├── denoise_dada2.nf
│ │ ...
│ │ └── create_results_summary.nf
│ └── nf-core/
│ ├── fastqc/
│ ├── trimmomatic/
│ └── kraken2/kraken2
│
├── data/
│ ├── adapters/
│ └── dev.csv
│
├── databases/
│ ├── silva-138-99-nb-classifier.qza
│ └── k2_Human_20230629.tar.gz
│
├── controls/
│ ├── metadata
│ │ ├── curated
│ │ │ └── metadata_cleaned.csv
│ │ └── non-curated
│ │ │ └── metadata_cleaned.csv
│
├── metadata/
│ ├── run_development_dataset
│ │ ├── curated
│ │ │ └── metadata_cleaned.csv
│ │ └── non_curated
│ │ │ ├── metadata_run.csv
│ │ │ └── metadata_sample.csv
│ └── templates
│ ├── metadata_run.csv
│ └── metadata_sample.csv
│
├── runs/
│ └── R[01-99][DDMMYY] (e.g. R01030525)/
│ ├── raw_data/
│ │ ├── S[A-Z]{2}[1-9][DDMMYY][R1-R2].fastq.gz
│ │ └── e.g. SAF1030525R1.fastq.gz
│ ├── metadata/
│ │ ├── metadata_cleaned.csv
│ │ ├── metadata_sample.csv
│ │ └── metadata_run.csv
│ ├── qc/
│ │ ├── raw/
│ │ │ ├── *_[1-2]_fastqc.html
│ │ │ └── *_[1-2]_fastqc.zip
│ │ └── trimmed/
│ │ ├── *_trimmed_[1-2]_fastqc.html
│ │ └── *_trimmed_[1-2]_fastqc.zip
│ ├── trimmed_reads/
│ │ ├── *.paired.trim_[1-2].fastq.gz
│ │ └── e.g. SAF1030525R1.paired.trim_1.fastq.gz
│ ├── logs/
│ │ └── trimmomatic/
│ │ ├── *_out.log
│ │ ├── *.summary
│ │ └── *_trim.log
│ ├── kraken/
│ │ ├── *.kraken2.classifiedreads.txt
│ │ └── *.kraken2.report.txt
│ ├── multiqc/
│ │ ├── multiqc_data/
│ │ └── multiqc_report.html
│ ├── qiime2/
│ │ ├── 00_reports
│ │ │ └── truncation_suggestion_report.txt
│ │ ├── 01_artifacts_input
│ │ │ ├── demux.qza
│ │ │ └── demux.qzv
│ │ ├── 02_denosied
│ │ │ ├── denoising-stats.qza
│ │ │ ├── rep-seqs.qza
│ │ │ └── table.qza
│ │ ├── 03_summaries
│ │ │ ├── rep-seqs.qzv
│ │ │ └── table.qzv
│ │ ├── 04_taxonomy
│ │ │ └── taxonomy.qza
│ │ ├── 05_phylogeny
│ │ │ ├── aligned-rep-seqs.qza
│ │ │ ├── masked-aligned-rep-seqs.qza
│ │ │ ├── rooted-tree.qza
│ │ │ └── unrooted-tree.qza
│ │ ├── 06_diversity
│ │ │ ├── alpha_rarefaction.qzv
│ │ │ ├── bray_curtis_distance_matrix.qza
│ │ │ ├── bray_curtis_emperor.qzv
│ │ │ ├── bray_curtis_pcoa_results.qza
│ │ │ ├── evenness_vector.qza
│ │ │ ├── faith_pd_vector.qza
│ │ │ ├── jaccard_distance_matrix.qza
│ │ │ ├── jaccard_emperor.qzv
│ │ │ ├── jaccard_pcoa_results.qza
│ │ │ ├── observed_features_vector.qza
│ │ │ ├── rarefied_table.qza
│ │ │ ├── shannon_vector.qza
│ │ │ ├── unweighted_unifrac_distance_matrix.qza
│ │ │ ├── unweighted_unifrac_emperor.qzv
│ │ │ ├── unweighted_unifrac_pcoa_results.qza
│ │ │ ├── weighted_unifrac_distance_matrix.qza
│ │ │ ├── weighted_unifrac_emperor.qzv
│ │ │ └── weighted_unifrac_pcoa_results.qza
│ │ ├── exported_results
│ │ │ ├── alpha_rarefaction/
│ │ │ │ └── index.html
│ │ │ ├── feature_table.tsv
│ │ │ ├── phylogenetic_tree.nwk
│ │ │ ├── representative_sequences.fasta
│ │ │ └── taxonomy.tsv
│ │ ├── analysis_summary.txt
│ │ └── feature_table_with_taxonomy.tsv
│ └── rarefaction_threshold/
│ ├── rarefaction_summary.txt
│ └── rarefaction_threshold.txt
│
├── visualization/
│ ├── report_template.Rmd
│ ├── shiny_dashboard_results_app.R
│ └── shiny_log.txt
│
├── INSTALLME.sh
├── create_run.sh
├── download_samples.sh
├── shiny_app.sh
├── Dockerfile
├── modules.json
│
├── .gitignore
├── LICENSE
└── README.md
Core Files
main.nf
: Entry point of the pipeline.nextflow.config
: Global configuration for resources and parameters.
scripts/
Helper scripts for data preprocessing, metadata formatting, and automation.
modules/
DSL2-style pipeline modules.
local/
: Custom-built modules for pipeline steps, mainly QIIME2 and MULTIQC.nf-core/
: Modules sourced from the nf-core repository (e.g.,fastqc/
,kraken2/
andtrimmomatic/
)
data/
Input resources and datasets for development/testing.
- Includes adapters used by Trimmomatic (specified in nextflow.config) and small dev.csv with the links to download the developement raw data. Create_run.sh script is using these files.
databases/
Pretrained reference databases used during taxonomic classification (e.g., SILVA, Kraken2).
controls/
Metadata related to control samples for quality benchmarking.
- Organized into
curated/
andnon-curated/
sets.
metadata/
Example and template metadata files used for developement.
- Includes development test sets (used by create_run.sh) and CSV templates.
runs/
Contains all per-run data and results. Each run is stored in a folder following the naming convention (can be found in wiki page File Nomenclature).
Subfolders include:
raw_data/
: Raw FASTQ files.metadata/
: Cleaned and raw metadata.qc/
: FastQC outputs (raw and trimmed).trimmed_reads/
: Trimmed reads.logs/
: Log files (e.g., Trimmomatic).kraken/
: Kraken2 results for contamination detection.multiqc/
: Aggregated quality reports.qiime2/
: Artifacts and results organized by QIIME2 step.rarefaction_threshold/
: Rarefaction analysis data.
results/
R/Shiny dashboard files and templates for generating final reports.
Other Files
Dockerfile
: Container environment setup.modules.json
: nf-core module tracking.INSTALLME.sh
: bash script to install kraken2 database and SILVA classifiershiny_app.sh
: bash script to run the shiny app.create_run.sh
: bash script that uses the developmental samples to create new runs following the naming convention.download_samples.sh
: bash script to download developmental samples for any use..gitignore
,README.md
,LICENSE
: Standard project files.