bb_fastq_stats - ampinzonv/BB3 GitHub Wiki

Function: `bb_fastq_stats`

Generate basic statistics from a FASTQ file, with support for random subsampling.

WARNING: For large fastq files bb_fastq_stats can take forever to output statistics since it does not natively subsample the library, meaning that will use all the sequences in your fastq file to calculate statistics. Our advice is to subsample the library and then calculate the statistics:

bb_fastq_subsample --input file.fq --sample_size 5 --quiet | bb_fastq_stats --input -

The command above will randomly subsample 5% of the library and then calculate statistics.

🔍 Description

This function analyzes the reads in a FASTQ file and produces a statistical summary that includes:

Total number of reads
Total sequence length
Minimum and maximum read length
Average read length
Percent of bases with quality ≥ Q20 and Q30

By default, it samples 10% of the reads to speed up processing. You can change this with --sample_size.

📥 Input

A FASTQ file, plain or gzip-compressed.
You can also use standard input with --input -.

📤 Output

A single-line summary report (tabular) with key metrics.

🧪 Examples

Analyze all reads:

bb_fastq_stats --input reads.fastq

Use a 5% subsample:

bb_fastq_stats --input reads.fastq --sample_size 5

Save the output:

bb_fastq_stats --input reads.fastq --outfile stats.tsv

Use in a pipe:

cat reads.fastq | bb_fastq_stats --input -

⚙️ Usage

bb_fastq_stats --input FILE [--outfile FILE] [--sample_size PCT] [--quiet] [--force]

🧵 Options

Option	Description
`--input FILE`	Input FASTQ file (or `-` for STDIN) (required)
`--outfile FILE`	File to save output (optional, default: STDOUT)
`--sample_size PCT`	Percent of reads to randomly sample (1–100, default: 10)
`--quiet`	Suppress log messages
`--force`	Overwrite output file if it exists

📌 Notes

Compatible with compressed .gz FASTQ files (requires gzcat on macOS, zcat on Linux).
Uses internal random shuffling to select sampled reads.
For reliable estimates, larger sample sizes yield more precise statistics.