BaseBuddy Functionality and Capabilities - ChromatinCloud/SeqForge GitHub Wiki
BaseBuddy offers a suite of tools to simulate and manage NGS data. Here's a breakdown of its core functionalities:
Simulate realistic short reads from a reference FASTA, primarily using the ART toolkit.
-
Command:
basebuddy short [options]
-
Key Parameters:
-
--reference <path>
: Path to your reference FASTA file. (Will be auto-indexed if.fai
is missing and--no-auto-index-fasta
is not used). -
--depth <int>
: Desired mean sequencing depth. -
--readlen <int>
: Length of simulated reads (e.g., 100, 150). -
--profile <name>
: ART sequencing system profile (e.g.,HS25
for Illumina HiSeq 2500,MSv3
for MiSeq). -
--art-platform <name>
: ART simulation platform (default:illumina
). -
--fragmean <int>
: Mean fragment length for paired-end reads (default: 200bp). -
--fragstd <int>
: Standard deviation of fragment length (default: 10bp). -
--single-end
: Generate single-end reads (default is paired-end). -
--output-root <dir>
: Global root directory for outputs. -
--run-name <name>
: Specific name for this simulation run's subdirectory. -
--timeout <seconds>
: Timeout for the ART simulation.
-
-
Outputs (within
output_root/run_name/
):- Paired-end FASTQ files (e.g.,
simulated_illumina_reads1.fq
,simulated_illumina_reads2.fq
). - Single-end FASTQ file (e.g.,
simulated_illumina_reads.fq
). - ART alignment file (
.aln
), if generated by ART. -
manifest.json
: Records parameters and output file paths. - (No direct BAM/VCF from this step, so no IGV session from
short
alone usually).
- Paired-end FASTQ files (e.g.,
Introduce known variants from a VCF file into an existing BAM file. This command acts as a wrapper around a conceptual addsnv.py
script or similar tool.
-
Command:
basebuddy spike [options]
-
Key Parameters:
-
--reference <path>
: Path to the reference FASTA (auto-indexed if needed). -
--in_bam <path>
: Path to the input BAM file (will be auto-indexed if.bai
is missing and--no-auto-index-input-bam
is not used). -
--vcf <path>
: Path to the VCF file containing variants to be introduced. -
--out_prefix_name <prefix>
: Prefix for the output BAM file (e.g.,spiked_sample
)..bam
will be appended. -
--output-root <dir>
,--run-name <name>
: For output management. -
--timeout <seconds>
: Timeout for the spiking process.
-
-
Outputs (within
output_root/run_name/
):- Modified BAM file (e.g.,
spiked_sample.bam
) containing reads with spiked variants. - BAM index file (
.bai
). -
manifest.json
. -
igv_session_spike.xml
: An IGV session file to visualize the output BAM and input VCF against the reference.
- Modified BAM file (e.g.,
Download reference genomes or other files from URLs, verify their integrity, and prepare them for use.
-
Command:
basebuddy download-ref [options]
-
Key Parameters:
-
--url <URL>
: URL of the file to download (HTTP/FTP). -
--filename <name>
: Desired filename for the saved file within the run's output directory. -
--checksum <hash>
: Expected checksum (e.g., SHA256) of the file for verification. -
--algo <sha256|md5|...>
: Checksum algorithm used (default:sha256
). -
--output-root <dir>
,--run-name <name>
: For output management. -
--timeout <seconds>
: Timeout for the download process.
-
-
Functionality & Outputs (within
output_root/run_name/
):- Downloads the file (e.g.,
reference.fasta.gz
). - Verifies the checksum.
- If the downloaded file is a FASTA (
.fa
,.fasta
,.fna
), it will be automatically indexed usingsamtools faidx
. - The downloaded file itself.
- FASTA index (
.fai
) if applicable. -
manifest.json
.
- Downloads the file (e.g.,
BaseBuddy helps keep your simulation outputs organized.
-
Standardized Output Structure:
- All outputs go into a main directory specified by
--output-root
(defaults to./basebuddy_outputs
). - Each execution (run) creates a unique subdirectory, either named by
--run-name
or auto-generated (e.g.,short_YYYYMMDD_HHMMSS
).
- All outputs go into a main directory specified by
-
Run Manifests:
- Every run generates a
manifest.json
file in its output subdirectory. - This JSON file records:
- The command executed.
- Timestamp and run status.
- All parameters used for the run.
- A list of key output files (with relative paths) and their types.
- Path to the reference genome used.
- Every run generates a
-
Listing Outputs:
-
Command:
basebuddy list-outputs [pattern]
(alias:basebuddy ls [pattern]
) - Scans the
--output-root
for run directories (identified bymanifest.json
). - Lists summary information for each run.
- Can filter runs by a name pattern (e.g.,
basebuddy ls "short_*"
). -
--show-all-files
option lists all files in a run directory, even those not in the manifest.
-
Command:
Facilitates quick visualization of genomic data (BAMs, VCFs) using the Integrative Genomics Viewer (IGV).
-
Automatic Session File Generation:
- For commands that produce BAM or VCF files (like
spike
), BaseBuddy automatically generates an IGV session file (e.g.,igv_session_spike.xml
). - This XML file is saved in the run's output directory.
- For commands that produce BAM or VCF files (like
-
Content: The session file is pre-configured to load:
- The reference genome FASTA used for the run.
- The relevant output tracks (e.g., the generated BAM file, input VCF file).
-
Usage:
- Download and install IGV.
- Open IGV and select "File" > "Open Session..." and choose the
.xml
file generated by BaseBuddy. - Alternatively, paths to the session file and key data files are printed to the console upon successful run completion and can be seen via
basebuddy list-outputs
.
BaseBuddy is designed with reliability and ease of use in mind.
- Pre-flight Checks: Before running external tools, BaseBuddy checks for their existence in your PATH and verifies that input files exist and are readable.
- Clear Error Reporting: When errors occur (either within BaseBuddy or from an external tool), BaseBuddy provides informative messages rather than raw stack traces, often including the stderr from the failed tool.
- Logging: Comprehensive logging (configurable level and optional file output) helps in debugging and tracking operations.
-
Automatic Indexing:
-
FASTA: If a reference FASTA provided to
short
orspike
is missing its.fai
index, BaseBuddy will attempt to create it usingsamtools faidx
(can be disabled with--no-auto-index-fasta
). -
BAM: If an input BAM for
spike
is missing its.bai
index, BaseBuddy will attempt to create it usingsamtools index
(can be disabled with--no-auto-index-input-bam
).
-
FASTA: If a reference FASTA provided to