Directory Structure - egenomics/agb2025 GitHub Wiki

The pipeline follows a clean, modular structure, organized as follows:

HdMBioinfo-MicrobiotaPipeline/
├── main.nf                        # Entry point of the pipeline     
├── nextflow.config 
│
├── scripts/                       # Helper scripts and small executables
│   ├── convert_qiime_to_long.py
│   ├── api_csv.py
│   ├── metadata_parsing.R
│   ├── Reorder.R
│   └── process_metadata.py
│
├── modules/                       # DSL2-style modules for each pipeline step
│   ├── local/
│   │   ├── import_reads.nf
│   │   ├── denoise_dada2.nf
│   │   ...
│   │   └── create_results_summary.nf
│   └── nf-core/
│       ├── fastqc/
│       ├── trimmomatic/
│       └── kraken2/kraken2
│
├── data/
│   ├── adapters/
│   └── dev.csv
│
├── databases/
│   ├── silva-138-99-nb-classifier.qza
│   └── k2_Human_20230629.tar.gz                
│
├── controls/
│   ├── metadata
│   │   ├── curated
│   │   │   └── metadata_cleaned.csv
│   │   └── non-curated
│   │   │   └── metadata_cleaned.csv
│
├── metadata/
│   ├── run_development_dataset
│   │   ├── curated
│   │   │   └── metadata_cleaned.csv
│   │   └── non_curated
│   │   │   ├── metadata_run.csv
│   │   │   └── metadata_sample.csv
│   └── templates
│       ├── metadata_run.csv
│       └── metadata_sample.csv
│
├── runs/
│   └── R[01-99][DDMMYY] (e.g. R01030525)/
│       ├── raw_data/
│       │   ├── S[A-Z]{2}[1-9][DDMMYY][R1-R2].fastq.gz 
│       │   └── e.g. SAF1030525R1.fastq.gz
│       ├── metadata/
│       │   ├── metadata_cleaned.csv
│       │   ├── metadata_sample.csv
│       │   └── metadata_run.csv
│       ├── qc/         
│       │   ├── raw/
│       │   │   ├── *_[1-2]_fastqc.html
│       │   │   └── *_[1-2]_fastqc.zip
│       │   └── trimmed/
│       │       ├── *_trimmed_[1-2]_fastqc.html
│       │       └── *_trimmed_[1-2]_fastqc.zip
│       ├── trimmed_reads/
│       │   ├── *.paired.trim_[1-2].fastq.gz
│       │   └── e.g. SAF1030525R1.paired.trim_1.fastq.gz
│       ├── logs/
│       │   └── trimmomatic/
│       │       ├── *_out.log
│       │       ├── *.summary
│       │       └── *_trim.log
│       ├── kraken/
│       │   ├── *.kraken2.classifiedreads.txt
│       │   └── *.kraken2.report.txt
│       ├── multiqc/
│       │   ├── multiqc_data/
│       │   └── multiqc_report.html
│       ├── qiime2/
│       │   ├── 00_reports
│       │   │   └── truncation_suggestion_report.txt
│       │   ├── 01_artifacts_input
│       │   │   ├── demux.qza
│       │   │   └── demux.qzv
│       │   ├── 02_denosied
│       │   │   ├── denoising-stats.qza
│       │   │   ├── rep-seqs.qza
│       │   │   └── table.qza
│       │   ├── 03_summaries
│       │   │   ├── rep-seqs.qzv
│       │   │   └──  table.qzv
│       │   ├── 04_taxonomy
│       │   │   └── taxonomy.qza
│       │   ├── 05_phylogeny
│       │   │   ├── aligned-rep-seqs.qza
│       │   │   ├── masked-aligned-rep-seqs.qza
│       │   │   ├── rooted-tree.qza
│       │   │   └──  unrooted-tree.qza
│       │   ├── 06_diversity
│       │   │   ├── alpha_rarefaction.qzv            
│       │   │   ├── bray_curtis_distance_matrix.qza  
│       │   │   ├── bray_curtis_emperor.qzv          
│       │   │   ├── bray_curtis_pcoa_results.qza     
│       │   │   ├── evenness_vector.qza              
│       │   │   ├── faith_pd_vector.qza              
│       │   │   ├── jaccard_distance_matrix.qza      
│       │   │   ├── jaccard_emperor.qzv              
│       │   │   ├── jaccard_pcoa_results.qza         
│       │   │   ├── observed_features_vector.qza
│       │   │   ├── rarefied_table.qza
│       │   │   ├── shannon_vector.qza
│       │   │   ├── unweighted_unifrac_distance_matrix.qza
│       │   │   ├── unweighted_unifrac_emperor.qzv
│       │   │   ├── unweighted_unifrac_pcoa_results.qza
│       │   │   ├── weighted_unifrac_distance_matrix.qza
│       │   │   ├── weighted_unifrac_emperor.qzv
│       │   │   └── weighted_unifrac_pcoa_results.qza
│       │   ├── exported_results
│       │   │   ├── alpha_rarefaction/
│       │   │   │  └── index.html
│       │   │   ├── feature_table.tsv
│       │   │   ├── phylogenetic_tree.nwk
│       │   │   ├── representative_sequences.fasta
│       │   │   └── taxonomy.tsv
│       │   ├── analysis_summary.txt
│       │   └── feature_table_with_taxonomy.tsv
│       └── rarefaction_threshold/
│           ├── rarefaction_summary.txt
│           └── rarefaction_threshold.txt
│
├── visualization/
│       ├── report_template.Rmd
│       ├── shiny_dashboard_results_app.R
│       └── shiny_log.txt
│
├── INSTALLME.sh
├── create_run.sh
├── download_samples.sh
├── shiny_app.sh
├── Dockerfile
├── modules.json
│
├── .gitignore
├── LICENSE
└── README.md

Core Files

  • main.nf: Entry point of the pipeline.
  • nextflow.config: Global configuration for resources and parameters.

scripts/

Helper scripts for data preprocessing, metadata formatting, and automation.

modules/

DSL2-style pipeline modules.

  • local/: Custom-built modules for pipeline steps, mainly QIIME2 and MULTIQC.
  • nf-core/: Modules sourced from the nf-core repository (e.g., fastqc/, kraken2/ and trimmomatic/)

data/

Input resources and datasets for development/testing.

  • Includes adapters used by Trimmomatic (specified in nextflow.config) and small dev.csv with the links to download the developement raw data. Create_run.sh script is using these files.

databases/

Pretrained reference databases used during taxonomic classification (e.g., SILVA, Kraken2).

controls/

Metadata related to control samples for quality benchmarking.

  • Organized into curated/ and non-curated/ sets.

metadata/

Example and template metadata files used for developement.

  • Includes development test sets (used by create_run.sh) and CSV templates.

runs/

Contains all per-run data and results. Each run is stored in a folder following the naming convention (can be found in wiki page File Nomenclature).

Subfolders include:

  • raw_data/: Raw FASTQ files.
  • metadata/: Cleaned and raw metadata.
  • qc/: FastQC outputs (raw and trimmed).
  • trimmed_reads/: Trimmed reads.
  • logs/: Log files (e.g., Trimmomatic).
  • kraken/: Kraken2 results for contamination detection.
  • multiqc/: Aggregated quality reports.
  • qiime2/: Artifacts and results organized by QIIME2 step.
  • rarefaction_threshold/: Rarefaction analysis data.

results/

R/Shiny dashboard files and templates for generating final reports.

Other Files

  • Dockerfile: Container environment setup.
  • modules.json: nf-core module tracking.
  • INSTALLME.sh: bash script to install kraken2 database and SILVA classifier
  • shiny_app.sh: bash script to run the shiny app.
  • create_run.sh: bash script that uses the developmental samples to create new runs following the naming convention.
  • download_samples.sh: bash script to download developmental samples for any use.
  • .gitignore, README.md, LICENSE: Standard project files.