bb_fastq_subsampling - ampinzonv/BB3 GitHub Wiki
bb_fastq_subsampling
Function: Randomly subsample a percentage of sequences from a FASTQ file.
๐ Description
This function takes a FASTQ file and selects a random subset of the reads, based on a specified sampling percentage. It preserves the original FASTQ format and can be used to downsample large datasets before quality control or testing.
๐ฅ Input
- A FASTQ file, plain or compressed (.gz).
- STDIN is supported using
--input -
.
๐ค Output
- A FASTQ file containing the subsampled reads.
- Output goes to STDOUT by default or to a file using
--outfile
.
๐งช Examples
Subsample 10% of the reads:
bb_fastq_subsampling --input reads.fastq --sample_size 10
Subsample 25% and save to file:
bb_fastq_subsampling --input reads.fastq --sample_size 25 --outfile subset.fastq
Use in a pipeline:
cat reads.fastq | bb_fastq_subsampling --input - --sample_size 5
Process a compressed FASTQ file:
bb_fastq_subsampling --input reads.fastq.gz --sample_size 20 --outfile subset.fastq
โ๏ธ Usage
bb_fastq_subsampling --input FILE [--sample_size PCT] [--outfile FILE] [--quiet] [--force]
๐งต Options
Option | Description |
---|---|
--input FILE |
Input FASTQ file or - for STDIN (required) |
--sample_size PCT |
Percentage of reads to subsample (1โ100, default: 10) |
--outfile FILE |
Output file (default: STDOUT) |
--quiet |
Suppress log messages |
--force |
Overwrite output file if it exists |
๐ Notes
- On macOS, this function uses
gzcat
to read compressed files. On Linux, it useszcat
. - If the number of reads in the input is very low, the function ensures at least one read is returned.
- The function performs deterministic selection only if random seeding is controlled outside the script.