bb_fastq_subsampling - ampinzonv/BB3 GitHub Wiki

Function: bb_fastq_subsampling

Randomly subsample a percentage of sequences from a FASTQ file.


๐Ÿ” Description

This function takes a FASTQ file and selects a random subset of the reads, based on a specified sampling percentage. It preserves the original FASTQ format and can be used to downsample large datasets before quality control or testing.


๐Ÿ“ฅ Input

  • A FASTQ file, plain or compressed (.gz).
  • STDIN is supported using --input -.

๐Ÿ“ค Output

  • A FASTQ file containing the subsampled reads.
  • Output goes to STDOUT by default or to a file using --outfile.

๐Ÿงช Examples

Subsample 10% of the reads:

bb_fastq_subsampling --input reads.fastq --sample_size 10

Subsample 25% and save to file:

bb_fastq_subsampling --input reads.fastq --sample_size 25 --outfile subset.fastq

Use in a pipeline:

cat reads.fastq | bb_fastq_subsampling --input - --sample_size 5

Process a compressed FASTQ file:

bb_fastq_subsampling --input reads.fastq.gz --sample_size 20 --outfile subset.fastq

โš™๏ธ Usage

bb_fastq_subsampling --input FILE [--sample_size PCT] [--outfile FILE] [--quiet] [--force]

๐Ÿงต Options

Option Description
--input FILE Input FASTQ file or - for STDIN (required)
--sample_size PCT Percentage of reads to subsample (1โ€“100, default: 10)
--outfile FILE Output file (default: STDOUT)
--quiet Suppress log messages
--force Overwrite output file if it exists

๐Ÿ“Œ Notes

  • On macOS, this function uses gzcat to read compressed files. On Linux, it uses zcat.
  • If the number of reads in the input is very low, the function ensures at least one read is returned.
  • The function performs deterministic selection only if random seeding is controlled outside the script.