BaseBuddy Functionality and Capabilities - ChromatinCloud/SeqForge GitHub Wiki

BaseBuddy offers a suite of tools to simulate and manage NGS data. Here's a breakdown of its core functionalities:

1. Short Read Simulation (via ART)

Simulate realistic short reads from a reference FASTA, primarily using the ART toolkit.

Command: basebuddy short [options]
Key Parameters:
- --reference <path>: Path to your reference FASTA file. (Will be auto-indexed if .fai is missing and --no-auto-index-fasta is not used).
- --depth <int>: Desired mean sequencing depth.
- --readlen <int>: Length of simulated reads (e.g., 100, 150).
- --profile <name>: ART sequencing system profile (e.g., HS25 for Illumina HiSeq 2500, MSv3 for MiSeq).
- --art-platform <name>: ART simulation platform (default: illumina).
- --fragmean <int>: Mean fragment length for paired-end reads (default: 200bp).
- --fragstd <int>: Standard deviation of fragment length (default: 10bp).
- --single-end: Generate single-end reads (default is paired-end).
- --output-root <dir>: Global root directory for outputs.
- --run-name <name>: Specific name for this simulation run's subdirectory.
- --timeout <seconds>: Timeout for the ART simulation.
Outputs (within output_root/run_name/):
- Paired-end FASTQ files (e.g., simulated_illumina_reads1.fq, simulated_illumina_reads2.fq).
- Single-end FASTQ file (e.g., simulated_illumina_reads.fq).
- ART alignment file (.aln), if generated by ART.
- manifest.json: Records parameters and output file paths.
- (No direct BAM/VCF from this step, so no IGV session from short alone usually).

2. Variant Spiking

Introduce known variants from a VCF file into an existing BAM file. This command acts as a wrapper around a conceptual addsnv.py script or similar tool.

Command: basebuddy spike [options]
Key Parameters:
- --reference <path>: Path to the reference FASTA (auto-indexed if needed).
- --in_bam <path>: Path to the input BAM file (will be auto-indexed if .bai is missing and --no-auto-index-input-bam is not used).
- --vcf <path>: Path to the VCF file containing variants to be introduced.
- --out_prefix_name <prefix>: Prefix for the output BAM file (e.g., spiked_sample). .bam will be appended.
- --output-root <dir>, --run-name <name>: For output management.
- --timeout <seconds>: Timeout for the spiking process.
Outputs (within output_root/run_name/):
- Modified BAM file (e.g., spiked_sample.bam) containing reads with spiked variants.
- BAM index file (.bai).
- manifest.json.
- igv_session_spike.xml: An IGV session file to visualize the output BAM and input VCF against the reference.

3. Reference Genome Download & Verification

Download reference genomes or other files from URLs, verify their integrity, and prepare them for use.

Command: basebuddy download-ref [options]
Key Parameters:
- --url <URL>: URL of the file to download (HTTP/FTP).
- --filename <name>: Desired filename for the saved file within the run's output directory.
- --checksum <hash>: Expected checksum (e.g., SHA256) of the file for verification.
- --algo <sha256|md5|...>: Checksum algorithm used (default: sha256).
- --output-root <dir>, --run-name <name>: For output management.
- --timeout <seconds>: Timeout for the download process.
Functionality & Outputs (within output_root/run_name/):
- Downloads the file (e.g., reference.fasta.gz).
- Verifies the checksum.
- If the downloaded file is a FASTA (.fa, .fasta, .fna), it will be automatically indexed using samtools faidx.
- The downloaded file itself.
- FASTA index (.fai) if applicable.
- manifest.json.

4. Output Management & Listing

BaseBuddy helps keep your simulation outputs organized.

Standardized Output Structure:
- All outputs go into a main directory specified by --output-root (defaults to ./basebuddy_outputs).
- Each execution (run) creates a unique subdirectory, either named by --run-name or auto-generated (e.g., short_YYYYMMDD_HHMMSS).
Run Manifests:
- Every run generates a manifest.json file in its output subdirectory.
- This JSON file records:
  - The command executed.
  - Timestamp and run status.
  - All parameters used for the run.
  - A list of key output files (with relative paths) and their types.
  - Path to the reference genome used.
Listing Outputs:
- Command: basebuddy list-outputs [pattern] (alias: basebuddy ls [pattern])
- Scans the --output-root for run directories (identified by manifest.json).
- Lists summary information for each run.
- Can filter runs by a name pattern (e.g., basebuddy ls "short_*").
- --show-all-files option lists all files in a run directory, even those not in the manifest.

5. IGV Integration

Facilitates quick visualization of genomic data (BAMs, VCFs) using the Integrative Genomics Viewer (IGV).

Automatic Session File Generation:
- For commands that produce BAM or VCF files (like spike), BaseBuddy automatically generates an IGV session file (e.g., igv_session_spike.xml).
- This XML file is saved in the run's output directory.
Content: The session file is pre-configured to load:
- The reference genome FASTA used for the run.
- The relevant output tracks (e.g., the generated BAM file, input VCF file).
Usage:
- Download and install IGV.
- Open IGV and select "File" > "Open Session..." and choose the .xml file generated by BaseBuddy.
- Alternatively, paths to the session file and key data files are printed to the console upon successful run completion and can be seen via basebuddy list-outputs.

6. Robustness & User Experience

BaseBuddy is designed with reliability and ease of use in mind.

Pre-flight Checks: Before running external tools, BaseBuddy checks for their existence in your PATH and verifies that input files exist and are readable.
Clear Error Reporting: When errors occur (either within BaseBuddy or from an external tool), BaseBuddy provides informative messages rather than raw stack traces, often including the stderr from the failed tool.
Logging: Comprehensive logging (configurable level and optional file output) helps in debugging and tracking operations.
Automatic Indexing:
- FASTA: If a reference FASTA provided to short or spike is missing its .fai index, BaseBuddy will attempt to create it using samtools faidx (can be disabled with --no-auto-index-fasta).
- BAM: If an input BAM for spike is missing its .bai index, BaseBuddy will attempt to create it using samtools index (can be disabled with --no-auto-index-input-bam).