Group 2 A - egenomics/agb2025 GitHub Wiki
- Paired-end fastq files located in
raw_data/
or in a folder structure likeoutputs/run_<run_id>/raw_data/
, depending on the version of the pipeline being used. - Sample metadata (
metadata_sample.csv
) provided by Group 1. - Run metadata (
run_metadata.csv
) including pipeline version and parameters.
All outputs are organized under outputs/run_<run_id>/
, including:
- Filtered fastq files in
processed_data/data_filtered/
- fastqc reports in
qc_reports/
- A merged multiqc report in
qc_reports/multiqc_report.html
- Updated sample metadata (merged with qc stats)
- Optionally, tar files with failtered reads for sharing with downstream groups
-
Pipeline orchestration with Nextflow
The pipeline is modular and reproducible, built with Nextflow and using separate Docker containers per module via-profile docker
. -
Adapter and quality trimming
Usingfastqc
andtrimmomatic
from nf-core modules, we trim adapters, low-quality bases, and short reads while keeping paired-end sync. -
Contamination removal
We usekraken2
with a customizable database. Due to Docker isolation, integration is ongoing and tracked in a separatekraken
branch. The main branch contains a minimal runnable pipeline.Kraken db download is scripted and checks local existence before downloading. Issues with curl and image compatibility are ongoing.
-
Quality control
fastqc is run pre- and post-trimming. Results are compiled into multiqc. Due to large file sizes, only summary metrics are merged into metadata. -
Sample metadata integration
We merge multiqc outputs with the sample metadata usingawk
,sed
, andcsvjoin
. Output saved asmetadata_sample_merged.csv
.
- The
main.nf
pipeline now accepts a--run_id
argument to determine where raw_data and outputs live. - Two download scripts are used:
-
download_samples.sh
: for pulling dev/test samples (small scale) -
create_run_and_download_samples.sh
: simulates a real run folder
-
- Final runs will assume that fastqc's are present in
outputs/run_<run_id>/raw_data/
.
- Kraken2 integration requires curl inside its container and fails without a base image.
- Multiple issues with nextflow config (summary not printed, modules fail silently).
- GitHub blocks sample upload due to size limits (>100mb).
- Group agreed to limit runs to 15 samples for feasibility.
- Shared chat is our main communication and coordination tool.
# create a run folder and download raw data
bash create_run_and_download_samples.sh
# run the pipeline
nextflow run main.nf --run_id <run_id> -profile docker
- Integrate kraken2 when stable.
- Finalize metadata format for multiqc metrics.
- Simplify run_id usage or script it automatically.
- Document everything clearly for group 2B integration.
As part of our quality control pipeline, we classified each sample based on its GC content using the summary output from multiqc. Specifically:
We extracted the %GC value for each sample from the multiqc_fastqc.txt file.
These values were then merged with our sample metadata (from sample_metadata.tsv) using a common sample identifier.
We applied a simple classification rule:
-
PASS: if %GC ≥ 40%
-
FAIL: if %GC < 40%
This rule was implemented using a small python script (pandas) in our classify_quality process in the nextflow pipeline. The final result, sample_metadata_classified.csv, contains all original metadata fields plus a new column called quality_flag indicating whether each sample passed or failed the GC content threshold.
Low GC samples (<40%) may indicate:
-
Host DNA contamination (e.g. human epithelial cells).
-
Amplification bias (AT-rich preferential amplification).
-
Poor sequencing quality or degraded samples.
-
Unusual or unexpected organisms (e.g. parasites, fungi, or environmental contaminants).
So the FAIL flag signals the need for further inspection of that sample.
Used in metadata handling processes of the AGB2025 pipeline to process and merge sample metadata and QC data using python and pandas.
docker build -t docker.io/agb2025-python -f Dockerfile .
You must use the full docker.io/ prefix so that nextflow can find it when you explicitly reference it in your config. This container is local, so all users must either:
-
Build it manually using the Dockerfile, OR
-
Push it to a shared Docker registry (e.g., Docker Hub under your lab's org)