08 developer overview - the-omics-os/lobster-local GitHub Wiki

Developer Overview - Lobster AI Architecture

🏗️ Overview

This guide provides a comprehensive introduction to developing within the Lobster AI codebase, covering architecture patterns, design principles, and development workflows. Lobster AI is a professional multi-agent bioinformatics analysis platform that combines specialized AI agents with proven scientific tools.

🎯 Core Design Principles

1. Agent-Based Architecture

Specialized Agents: Each agent handles specific bioinformatics domains (transcriptomics, proteomics)
Centralized Registry: Single source of truth for agent configuration via AGENT_REGISTRY
Natural Language Interface: Users describe analyses in plain English

2. Modular Service Design

Stateless Services: All analysis services are stateless and return (processed_adata, statistics_dict)
Separation of Concerns: Agents coordinate workflows, services handle computation
Reusable Components: Services can be used independently or composed in workflows

3. Multi-Modal Data Management

DataManagerV2: Centralized orchestrator for multi-omics data with modality management
Professional Naming: Consistent naming conventions for dataset versions and analysis stages
Provenance Tracking: W3C-PROV compliant analysis history for reproducibility

4. Cloud/Local Hybrid Architecture

BaseClient Interface: Consistent API for local and cloud execution
Seamless Switching: Automatic detection and fallback between cloud and local modes
Unified CLI: Single interface supporting both execution environments

🏛️ Architecture Components

Core Directories

lobster/
├── agents/          # Specialized AI agents for bioinformatics domains
├── core/            # Data management, client infrastructure, interfaces
├── tools/           # Stateless analysis services
├── config/          # Configuration management and agent registry
├── cli.py           # Modern terminal interface with autocomplete
└── utils/           # Shared utilities and logging

Key Architectural Patterns

1. Agent Registry Pattern

# lobster/config/agent_registry.py
@dataclass
class AgentRegistryConfig:
    name: str                          # Unique identifier
    display_name: str                  # Human-readable name
    description: str                   # Agent capabilities
    factory_function: str             # Module path to factory
    handoff_tool_name: Optional[str]  # Auto-generated tool name

AGENT_REGISTRY = {
    'data_expert_agent': AgentRegistryConfig(...),
    'transcriptomics_expert': AgentRegistryConfig(...),
    'proteomics_expert': AgentRegistryConfig(...),
    # ... more agents
}

2. Service Pattern

class QualityService:
    """Stateless service for data quality assessment."""

    def assess_quality(self, adata: anndata.AnnData, **params) -> Tuple[anndata.AnnData, Dict]:
        """
        Returns:
            Tuple of (processed_adata, statistics_dict)
        """
        # Stateless processing logic
        return processed_adata, statistics

3. Agent Tool Pattern

@tool
def assess_data_quality(modality_name: str, **params) -> str:
    """Standard pattern for all agent tools."""
    # 1. Validate modality exists
    if modality_name not in data_manager.list_modalities():
        raise ModalityNotFoundError(f"Modality '{modality_name}' not found")

    # 2. Get data and call stateless service
    adata = data_manager.get_modality(modality_name)
    result_adata, stats = service.assess_quality(adata, **params)

    # 3. Store results with descriptive naming
    new_modality = f"{modality_name}_quality_assessed"
    data_manager.modalities[new_modality] = result_adata

    # 4. Log operation for provenance
    data_manager.log_tool_usage("assess_data_quality", params, stats)

    return formatted_response(stats, new_modality)

4. Client Adapter Pattern

# lobster/core/interfaces/base_client.py
class BaseClient(ABC):
    @abstractmethod
    def query(self, user_input: str, stream: bool = False) -> Dict[str, Any]:
        pass

    @abstractmethod
    def get_status(self) -> Dict[str, Any]:
        pass

# Implementations: AgentClient (local), CloudLobsterClient (cloud)

🔧 Development Setup

1. Environment Setup

# Clone repository
git clone <repository-url>
cd lobster

# Install development dependencies
make dev-install

# Activate environment
source .venv/bin/activate

# Verify installation
python -m lobster --help

2. Required Environment Variables

# Required API Keys
export AWS_BEDROCK_ACCESS_KEY="your-aws-access-key"
export AWS_BEDROCK_SECRET_ACCESS_KEY="your-aws-secret-key"

# Optional
export NCBI_API_KEY="your-ncbi-api-key"
export LOBSTER_CLOUD_KEY="your-cloud-api-key"  # Enables cloud mode

3. Development Commands

# Run all tests
make test

# Fast parallel testing
make test-fast

# Code formatting
make format

# Linting
make lint

# Type checking
make type-check

# Start CLI
lobster chat

🧪 Scientific Workflows

Professional Naming Convention

geo_gse12345                          # Raw downloaded data
├── geo_gse12345_quality_assessed     # QC metrics added
├── geo_gse12345_filtered_normalized  # Preprocessed data
├── geo_gse12345_doublets_detected    # Doublet annotations
├── geo_gse12345_clustered           # Leiden clustering + UMAP
├── geo_gse12345_markers              # Differential expression
├── geo_gse12345_annotated           # Cell type annotations
└── geo_gse12345_pseudobulk          # Aggregated for DE analysis

Data Flow Architecture

User Input (CLI)
    ↓
LobsterClientAdapter → BaseClient (AgentClient | CloudLobsterClient)
    ↓
Agent Registry → Specialized Agent (data_expert, transcriptomics_expert, etc.)
    ↓
Agent Tools → Stateless Services (QualityService, ClusteringService, etc.)
    ↓
DataManagerV2 → Modality Management → Storage Backends (H5AD, MuData)
    ↓
Results → CLI Response with Visualizations

🎨 Code Style Guidelines

1. Python Standards

Follow PEP 8 style guidelines
Use type hints for all functions and methods
Line length: 88 characters (Black formatting)
Comprehensive docstrings for all public functions

2. Scientific Accuracy

Prioritize scientific accuracy over performance optimizations
Include comprehensive QC metrics at each analysis step
Support batch effect detection and correction
Implement proper missing value handling strategies

3. Error Handling

# Use specific exceptions
class ModalityNotFoundError(Exception):
    pass

class ServiceError(Exception):
    pass

# Proper error handling in tools
try:
    result = service.process(data)
except ServiceError as e:
    logger.error(f"Service error: {e}")
    return f"Analysis failed: {str(e)}"

🚀 Development Workflow

1. Adding New Features

Design First: Consider how the feature fits into existing patterns
Use Registry: For agents, add to AGENT_REGISTRY instead of manual graph edits
Follow Patterns: Use established service, tool, and adapter patterns
Test Thoroughly: Include unit, integration, and scientific validation tests
Document: Update relevant documentation files

2. Code Quality Checklist

Type hints on all functions
Comprehensive docstrings
Error handling with specific exceptions
Unit tests with 80%+ coverage
Integration tests with real data
Scientific validation where applicable
CLI compatibility (local and cloud)

3. Pre-commit Hooks

# Install pre-commit hooks
pre-commit install

# Run manually
pre-commit run --all-files

📊 Performance Considerations

1. Memory Management

Use memory-efficient data loading for large datasets
Implement lazy loading where possible
Monitor memory usage in long-running analyses

2. Computation Optimization

Leverage GPU acceleration when available (ScVI, rapids)
Use efficient algorithms for large-scale data
Implement progress tracking for long operations

3. Caching Strategy

File operations: 60s cache for cloud, 10s for local
Intelligent caching for expensive computations
Clear cache invalidation strategies

🔍 Debugging and Troubleshooting

1. Common Issues

Import Errors: Check environment activation and dependencies
Agent Registry: Verify factory function paths are correct
Data Loading: Check file permissions and formats
Cloud Integration: Verify API keys and network connectivity

2. Debugging Tools

# Use structured logging
from lobster.utils.logger import get_logger
logger = get_logger(__name__)

# Enable debug mode
logger.setLevel(logging.DEBUG)

# Check system status
lobster chat
/status

3. Testing Connectivity

# Test agent registry
python -c "from lobster.config.agent_registry import AGENT_REGISTRY; print(list(AGENT_REGISTRY.keys()))"

# Test CLI with both clients
LOBSTER_CLOUD_KEY="" python -m lobster chat  # Local mode
LOBSTER_CLOUD_KEY="key" python -m lobster chat  # Cloud mode

📚 Further Reading

Creating Agents Guide - Detailed agent development
Creating Services Guide - Service implementation patterns
Creating Adapters Guide - Data adapter development
Testing Guide - Comprehensive testing framework
CLAUDE.md - Complete architectural documentation

🎯 Quick Reference

Key Files to Know

lobster/config/agent_registry.py - Agent configuration registry
lobster/core/interfaces/base_client.py - Client interface definition
lobster/core/data_manager_v2.py - Multi-modal data orchestrator
lobster/cli.py - CLI implementation with autocomplete
tests/conftest.py - Test configuration and fixtures

Essential Commands

make dev-install    # Development setup
make test          # Run all tests
lobster chat       # Start interactive CLI
/help              # Show available commands
/status            # System status
/files             # List workspace files

This overview provides the foundation for contributing to Lobster AI. Each component follows established patterns that promote consistency, maintainability, and scientific rigor.