Group 2 A - egenomics/agb2025 GitHub Wiki

Group 2A

input we need

  • Paired-end fastq files located in raw_data/ or in a folder structure like outputs/run_<run_id>/raw_data/, depending on the version of the pipeline being used.
  • Sample metadata (metadata_sample.csv) provided by Group 1.
  • Run metadata (run_metadata.csv) including pipeline version and parameters.

output we will create

All outputs are organized under outputs/run_<run_id>/, including:

  • Filtered fastq files in processed_data/data_filtered/
  • fastqc reports in qc_reports/
  • A merged multiqc report in qc_reports/multiqc_report.html
  • Updated sample metadata (merged with qc stats)
  • Optionally, tar files with failtered reads for sharing with downstream groups

objectives of the group

  • Pipeline orchestration with Nextflow
    The pipeline is modular and reproducible, built with Nextflow and using separate Docker containers per module via -profile docker.

  • Adapter and quality trimming
    Using fastqc and trimmomatic from nf-core modules, we trim adapters, low-quality bases, and short reads while keeping paired-end sync.

  • Contamination removal
    We use kraken2 with a customizable database. Due to Docker isolation, integration is ongoing and tracked in a separate kraken branch. The main branch contains a minimal runnable pipeline.

    Kraken db download is scripted and checks local existence before downloading. Issues with curl and image compatibility are ongoing.

  • Quality control
    fastqc is run pre- and post-trimming. Results are compiled into multiqc. Due to large file sizes, only summary metrics are merged into metadata.

  • Sample metadata integration
    We merge multiqc outputs with the sample metadata using awk, sed, and csvjoin. Output saved as metadata_sample_merged.csv.

implementation decisions

  • The main.nf pipeline now accepts a --run_id argument to determine where raw_data and outputs live.
  • Two download scripts are used:
    • download_samples.sh: for pulling dev/test samples (small scale)
    • create_run_and_download_samples.sh: simulates a real run folder
  • Final runs will assume that fastqc's are present in outputs/run_<run_id>/raw_data/.

current challenges

  • Kraken2 integration requires curl inside its container and fails without a base image.
  • Multiple issues with nextflow config (summary not printed, modules fail silently).
  • GitHub blocks sample upload due to size limits (>100mb).
  • Group agreed to limit runs to 15 samples for feasibility.
  • Shared chat is our main communication and coordination tool.

suggested pipeline usage

# create a run folder and download raw data
bash create_run_and_download_samples.sh

# run the pipeline
nextflow run main.nf --run_id <run_id> -profile docker

outstanding tasks

  • Integrate kraken2 when stable.
  • Finalize metadata format for multiqc metrics.
  • Simplify run_id usage or script it automatically.
  • Document everything clearly for group 2B integration.

sample classification based on GC content

As part of our quality control pipeline, we classified each sample based on its GC content using the summary output from multiqc. Specifically:

We extracted the %GC value for each sample from the multiqc_fastqc.txt file.

These values were then merged with our sample metadata (from sample_metadata.tsv) using a common sample identifier.

We applied a simple classification rule:

  • PASS: if %GC ≥ 40%

  • FAIL: if %GC < 40%

This rule was implemented using a small python script (pandas) in our classify_quality process in the nextflow pipeline. The final result, sample_metadata_classified.csv, contains all original metadata fields plus a new column called quality_flag indicating whether each sample passed or failed the GC content threshold.

Low GC samples (<40%) may indicate:

  • Host DNA contamination (e.g. human epithelial cells).

  • Amplification bias (AT-rich preferential amplification).

  • Poor sequencing quality or degraded samples.

  • Unusual or unexpected organisms (e.g. parasites, fungi, or environmental contaminants).

So the FAIL flag signals the need for further inspection of that sample.

agb2025-python: custom python container with pandas and csvjoin

Used in metadata handling processes of the AGB2025 pipeline to process and merge sample metadata and QC data using python and pandas.

docker build -t docker.io/agb2025-python -f Dockerfile .

You must use the full docker.io/ prefix so that nextflow can find it when you explicitly reference it in your config. This container is local, so all users must either:

  • Build it manually using the Dockerfile, OR

  • Push it to a shared Docker registry (e.g., Docker Hub under your lab's org)

⚠️ **GitHub.com Fallback** ⚠️