bb_fastq_stats - ampinzonv/BB3 GitHub Wiki

Function: bb_fastq_stats

Generate basic statistics from a FASTQ file, with support for random subsampling.

WARNING: For large fastq files bb_fastq_stats can take forever to output statistics since it does not natively subsample the library, meaning that will use all the sequences in your fastq file to calculate statistics. Our advice is to subsample the library and then calculate the statistics:

bb_fastq_subsample --input file.fq --sample_size 5 --quiet | bb_fastq_stats --input -

The command above will randomly subsample 5% of the library and then calculate statistics.


๐Ÿ” Description

This function analyzes the reads in a FASTQ file and produces a statistical summary that includes:

  • Total number of reads
  • Total sequence length
  • Minimum and maximum read length
  • Average read length
  • Percent of bases with quality โ‰ฅ Q20 and Q30

By default, it samples 10% of the reads to speed up processing. You can change this with --sample_size.


๐Ÿ“ฅ Input

  • A FASTQ file, plain or gzip-compressed.
  • You can also use standard input with --input -.

๐Ÿ“ค Output

  • A single-line summary report (tabular) with key metrics.

๐Ÿงช Examples

Analyze all reads:

bb_fastq_stats --input reads.fastq

Use a 5% subsample:

bb_fastq_stats --input reads.fastq --sample_size 5

Save the output:

bb_fastq_stats --input reads.fastq --outfile stats.tsv

Use in a pipe:

cat reads.fastq | bb_fastq_stats --input -

โš™๏ธ Usage

bb_fastq_stats --input FILE [--outfile FILE] [--sample_size PCT] [--quiet] [--force]

๐Ÿงต Options

Option Description
--input FILE Input FASTQ file (or - for STDIN) (required)
--outfile FILE File to save output (optional, default: STDOUT)
--sample_size PCT Percent of reads to randomly sample (1โ€“100, default: 10)
--quiet Suppress log messages
--force Overwrite output file if it exists

๐Ÿ“Œ Notes

  • Compatible with compressed .gz FASTQ files (requires gzcat on macOS, zcat on Linux).
  • Uses internal random shuffling to select sampled reads.
  • For reliable estimates, larger sample sizes yield more precise statistics.