Installation and Setup Guide - ChromatinCloud/SeqForge GitHub Wiki

This guide provides step-by-step instructions for installing BaseBuddy and preparing the necessary reference genome data.

  1. Overview

A correct installation is the foundation for using BaseBuddy. We offer two primary installation methods to suit different user needs:

Python with Conda (Recommended): This is the ideal method for most users on Linux or macOS. The Conda package manager handles the installation of Python, all required Python libraries, and all external command-line tools (e.g., SAMtools, ART), preventing conflicts with other software on your system.
Docker: This method provides a completely self-contained, pre-configured environment. It is the best option for ensuring perfect reproducibility and for avoiding installation issues on complex systems or Windows (via WSL2).
  1. Installation via Python (Conda)

Prerequisites:

A working Conda installation (Miniconda or Anaconda). We recommend using Mamba, a much faster drop-in replacement for Conda, if possible.
Git, for cloning the software repository.

Steps:

Clone the Repository:
Open a terminal and clone the BaseBuddy source code from GitHub.
Bash

git clone https://github.com/yourusername/BaseBuddy.git cd BaseBuddy

Create Conda Environment: The environment.yml file in the repository lists all dependencies. Create the environment from this file. Bash

Using Mamba (much faster)

mamba env create -f environment.yml

If you don't have Mamba, use Conda

conda env create -f environment.yml

This single command installs everything: samtools, art, bamsurgeon, fastqc, and all Python packages.

Activate the Environment: You must activate the environment each time you want to use BaseBuddy. Bash

mamba activate basebuddy

or

conda activate basebuddy

Install BaseBuddy: This command links your installation to the cloned source code. Bash

pip install -e .

Verify Installation: Check that the command-line tool is working. Bash

basebuddy version
# Expected output: BaseBuddy 0.1.0
  1. Installation via Docker

Prerequisites:

A working Docker installation.
Git.

Steps:

Clone the Repository:
Bash

git clone https://github.com/yourusername/BaseBuddy.git cd BaseBuddy

Build the Docker Image: From the repository root (where the Dockerfile is), run the build command. Bash

DOCKER_BUILDKIT=1 docker build -t basebuddy:latest .

Verify Installation: Run the version command inside a temporary container. Bash

docker run --rm basebuddy:latest version
# Expected output: BaseBuddy 0.1.0

Running Commands with Docker: To use the Dockerized BaseBuddy, you must mount your local data directory into the container using the -v flag. Bash

Example: Run short-read simulation

This mounts the current directory into the /data directory inside the container

docker run --rm -v "$(pwd):/data" basebuddy:latest
short /data/my_ref.fa --outdir /data/sim_output

Note for macOS/Windows Users: Ensure your project directory is included in Docker Desktop's list of approved directories for file sharing (in Preferences/Settings). 4. Preparing a Reference FASTA

Nearly every function in BaseBuddy requires a reference genome in FASTA format, which must be indexed.

Steps:

Download a Reference:
Obtain a standard reference genome from a public repository like NCBI, Ensembl, or UCSC.
Bash

Example for GRCh38 human genome

wget ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/000/001/405/GCA_000001405.15_GRCh38/GRCh38_latest_genomic.fna.gz gunzip GRCh38_latest_genomic.fna.gz

Index the FASTA: This step creates a .fai index file, allowing tools to access specific genomic locations quickly. This is not optional. Bash

samtools faidx GRCh38_latest_genomic.fna

(Optional) Create a Locus-Specific FASTA: For testing, it's much faster to work with a small genomic region. Extract a locus using samtools faidx. Bash

Extract a 200kb region around the FGFR2 gene

samtools faidx GRCh38_latest_genomic.fna chr10:122950000-123250000 > fgfr2_locus.fa

IMPORTANT: You must index the new, smaller FASTA file too!

samtools faidx fgfr2_locus.fa

Using fgfr2_locus.fa instead of the full genome will make your commands run dramatically faster.