BaseBuddy Functionality and Capabilities - ChromatinCloud/SeqForge GitHub Wiki

BaseBuddy offers a suite of tools to simulate and manage NGS data. Here's a breakdown of its core functionalities:

1. Short Read Simulation (via ART)

Simulate realistic short reads from a reference FASTA, primarily using the ART toolkit.

  • Command: basebuddy short [options]
  • Key Parameters:
    • --reference <path>: Path to your reference FASTA file. (Will be auto-indexed if .fai is missing and --no-auto-index-fasta is not used).
    • --depth <int>: Desired mean sequencing depth.
    • --readlen <int>: Length of simulated reads (e.g., 100, 150).
    • --profile <name>: ART sequencing system profile (e.g., HS25 for Illumina HiSeq 2500, MSv3 for MiSeq).
    • --art-platform <name>: ART simulation platform (default: illumina).
    • --fragmean <int>: Mean fragment length for paired-end reads (default: 200bp).
    • --fragstd <int>: Standard deviation of fragment length (default: 10bp).
    • --single-end: Generate single-end reads (default is paired-end).
    • --output-root <dir>: Global root directory for outputs.
    • --run-name <name>: Specific name for this simulation run's subdirectory.
    • --timeout <seconds>: Timeout for the ART simulation.
  • Outputs (within output_root/run_name/):
    • Paired-end FASTQ files (e.g., simulated_illumina_reads1.fq, simulated_illumina_reads2.fq).
    • Single-end FASTQ file (e.g., simulated_illumina_reads.fq).
    • ART alignment file (.aln), if generated by ART.
    • manifest.json: Records parameters and output file paths.
    • (No direct BAM/VCF from this step, so no IGV session from short alone usually).

2. Variant Spiking

Introduce known variants from a VCF file into an existing BAM file. This command acts as a wrapper around a conceptual addsnv.py script or similar tool.

  • Command: basebuddy spike [options]
  • Key Parameters:
    • --reference <path>: Path to the reference FASTA (auto-indexed if needed).
    • --in_bam <path>: Path to the input BAM file (will be auto-indexed if .bai is missing and --no-auto-index-input-bam is not used).
    • --vcf <path>: Path to the VCF file containing variants to be introduced.
    • --out_prefix_name <prefix>: Prefix for the output BAM file (e.g., spiked_sample). .bam will be appended.
    • --output-root <dir>, --run-name <name>: For output management.
    • --timeout <seconds>: Timeout for the spiking process.
  • Outputs (within output_root/run_name/):
    • Modified BAM file (e.g., spiked_sample.bam) containing reads with spiked variants.
    • BAM index file (.bai).
    • manifest.json.
    • igv_session_spike.xml: An IGV session file to visualize the output BAM and input VCF against the reference.

3. Reference Genome Download & Verification

Download reference genomes or other files from URLs, verify their integrity, and prepare them for use.

  • Command: basebuddy download-ref [options]
  • Key Parameters:
    • --url <URL>: URL of the file to download (HTTP/FTP).
    • --filename <name>: Desired filename for the saved file within the run's output directory.
    • --checksum <hash>: Expected checksum (e.g., SHA256) of the file for verification.
    • --algo <sha256|md5|...>: Checksum algorithm used (default: sha256).
    • --output-root <dir>, --run-name <name>: For output management.
    • --timeout <seconds>: Timeout for the download process.
  • Functionality & Outputs (within output_root/run_name/):
    • Downloads the file (e.g., reference.fasta.gz).
    • Verifies the checksum.
    • If the downloaded file is a FASTA (.fa, .fasta, .fna), it will be automatically indexed using samtools faidx.
    • The downloaded file itself.
    • FASTA index (.fai) if applicable.
    • manifest.json.

4. Output Management & Listing

BaseBuddy helps keep your simulation outputs organized.

  • Standardized Output Structure:
    • All outputs go into a main directory specified by --output-root (defaults to ./basebuddy_outputs).
    • Each execution (run) creates a unique subdirectory, either named by --run-name or auto-generated (e.g., short_YYYYMMDD_HHMMSS).
  • Run Manifests:
    • Every run generates a manifest.json file in its output subdirectory.
    • This JSON file records:
      • The command executed.
      • Timestamp and run status.
      • All parameters used for the run.
      • A list of key output files (with relative paths) and their types.
      • Path to the reference genome used.
  • Listing Outputs:
    • Command: basebuddy list-outputs [pattern] (alias: basebuddy ls [pattern])
    • Scans the --output-root for run directories (identified by manifest.json).
    • Lists summary information for each run.
    • Can filter runs by a name pattern (e.g., basebuddy ls "short_*").
    • --show-all-files option lists all files in a run directory, even those not in the manifest.

5. IGV Integration

Facilitates quick visualization of genomic data (BAMs, VCFs) using the Integrative Genomics Viewer (IGV).

  • Automatic Session File Generation:
    • For commands that produce BAM or VCF files (like spike), BaseBuddy automatically generates an IGV session file (e.g., igv_session_spike.xml).
    • This XML file is saved in the run's output directory.
  • Content: The session file is pre-configured to load:
    • The reference genome FASTA used for the run.
    • The relevant output tracks (e.g., the generated BAM file, input VCF file).
  • Usage:
    • Download and install IGV.
    • Open IGV and select "File" > "Open Session..." and choose the .xml file generated by BaseBuddy.
    • Alternatively, paths to the session file and key data files are printed to the console upon successful run completion and can be seen via basebuddy list-outputs.

6. Robustness & User Experience

BaseBuddy is designed with reliability and ease of use in mind.

  • Pre-flight Checks: Before running external tools, BaseBuddy checks for their existence in your PATH and verifies that input files exist and are readable.
  • Clear Error Reporting: When errors occur (either within BaseBuddy or from an external tool), BaseBuddy provides informative messages rather than raw stack traces, often including the stderr from the failed tool.
  • Logging: Comprehensive logging (configurable level and optional file output) helps in debugging and tracking operations.
  • Automatic Indexing:
    • FASTA: If a reference FASTA provided to short or spike is missing its .fai index, BaseBuddy will attempt to create it using samtools faidx (can be disabled with --no-auto-index-fasta).
    • BAM: If an input BAM for spike is missing its .bai index, BaseBuddy will attempt to create it using samtools index (can be disabled with --no-auto-index-input-bam).
⚠️ **GitHub.com Fallback** ⚠️