34 architecture diagram - the-omics-os/lobster-local GitHub Wiki

Lobster AI - Cloud & Local Architecture

๐Ÿ—๏ธ System Architecture Overview

Lobster AI is a powerful multi-agent bioinformatics platform with seamless cloud and local deployment capabilities. The system automatically detects your configuration and routes requests appropriately.

โ˜๏ธ Cloud/Local Architecture Pattern

graph TB
    subgraph "User Interface Layer"
        CLI[๐Ÿฆž Lobster CLI<br/>lobster chat]
        STREAMLIT[๐Ÿ“Š Streamlit Web UI<br/>streamlit_app.py]
        API[๐Ÿ”Œ FastAPI Server<br/>lobster serve]
    end

    subgraph "Smart Client Detection"
        DETECT[Environment Check<br/>LOBSTER_CLOUD_KEY?]
        INIT[init_client()<br/>๐Ÿ”„ Automatic Switching]
    end

    subgraph "โ˜๏ธ Cloud Mode (LOBSTER_CLOUD_KEY set)"
        CLOUD_CLIENT[CloudLobsterClient<br/>๐ŸŒฉ๏ธ HTTP API Calls]
        CLOUD_API[Lobster Cloud API<br/>api.lobster.omics-os.com]
        AWS_INFRA[AWS Infrastructure<br/>๐Ÿ—๏ธ Scalable Compute]
    end

    subgraph "๐Ÿ’ป Local Mode (No cloud key or fallback)"
        LOCAL_CLIENT[AgentClient<br/>๐Ÿ–ฅ๏ธ Local LangGraph Processing]
        LOCAL_AGENTS[Local AI Agents<br/>๐Ÿค– Full Agent Pipeline]
        LOCAL_DATA[DataManagerV2<br/>๐Ÿ“Š Local Data Management]
    end

    subgraph "๐Ÿ”„ Unified Interface"
        BASE_CLIENT[BaseClient Interface<br/>๐Ÿ“‹ Common Methods Contract]
        METHODS[query(), get_status()<br/>read_file(), export_session()]
    end

    CLI --> INIT
    STREAMLIT --> INIT
    API --> INIT

    INIT --> DETECT
    DETECT --> |Cloud Key Present| CLOUD_CLIENT
    DETECT --> |No Key/Fallback| LOCAL_CLIENT

    CLOUD_CLIENT --> |HTTP/REST| CLOUD_API
    CLOUD_API --> AWS_INFRA

    LOCAL_CLIENT --> LOCAL_AGENTS
    LOCAL_CLIENT --> LOCAL_DATA

    CLOUD_CLIENT -.-> |Implements| BASE_CLIENT
    LOCAL_CLIENT -.-> |Implements| BASE_CLIENT
    BASE_CLIENT --> METHODS

    classDef ui fill:#e3f2fd,stroke:#0277bd,stroke-width:2px
    classDef detect fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef cloud fill:#e8f5e8,stroke:#2e7d32,stroke-width:3px
    classDef local fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef interface fill:#fce4ec,stroke:#c2185b,stroke-width:2px

    class CLI,STREAMLIT,API ui
    class DETECT,INIT detect
    class CLOUD_CLIENT,CLOUD_API,AWS_INFRA cloud
    class LOCAL_CLIENT,LOCAL_AGENTS,LOCAL_DATA local
    class BASE_CLIENT,METHODS interface
Loading

๐Ÿ”„ Seamless Mode Switching Flow

sequenceDiagram
    participant User
    participant CLI
    participant init_client
    participant CloudClient
    participant LocalClient
    participant CloudAPI

    User->>CLI: lobster chat
    CLI->>init_client: Initialize client

    Note over init_client: Check LOBSTER_CLOUD_KEY

    alt Cloud Key Present
        init_client->>CloudClient: new CloudLobsterClient(api_key)
        CloudClient->>CloudAPI: Test connection (get_status)

        alt Connection Success
            CloudAPI-->>CloudClient: โœ… Status OK
            CloudClient-->>init_client: โœ… Cloud client ready
            init_client-->>CLI: CloudLobsterClient instance
            CLI-->>User: ๐ŸŒฉ๏ธ "Cloud mode active"
        else Connection Failed
            CloudAPI-->>CloudClient: โŒ Connection failed
            CloudClient-->>init_client: โŒ Error
            init_client->>LocalClient: Fallback to local
            LocalClient-->>init_client: โœ… Local client ready
            init_client-->>CLI: AgentClient instance
            CLI-->>User: ๐Ÿ’ป "Using local mode (cloud unavailable)"
        end
    else No Cloud Key
        init_client->>LocalClient: new AgentClient()
        LocalClient-->>init_client: โœ… Local client ready
        init_client-->>CLI: AgentClient instance
        CLI-->>User: ๐Ÿ’ป "Using local mode"
    end

    User->>CLI: "Analyze my RNA-seq data"
    CLI->>init_client: client.query(user_input)

    alt Using Cloud Client
        init_client->>CloudAPI: POST /query
        CloudAPI-->>init_client: Analysis response
    else Using Local Client
        init_client->>LocalClient: Process with local agents
        LocalClient-->>init_client: Analysis response
    end

    init_client-->>CLI: Standardized response
    CLI-->>User: ๐Ÿฆž Analysis results
Loading

๐Ÿ“ฆ Clean Single Package Structure

graph TB
    subgraph "Lobster AI Open Source"
        MAIN[lobster/<br/>๐Ÿฆž Complete Bioinformatics Platform<br/>All Features Included]

        subgraph "Core Components"
            CLI[cli.py<br/>๐Ÿ”„ Smart CLI with Cloud Detection]
            AGENTS[agents/<br/>๐Ÿค– All AI Agents]
            CORE[core/<br/>๐Ÿ“Š DataManagerV2 & Client]
            TOOLS[tools/<br/>๐Ÿ”ง Analysis Services]
            UTILS[utils/<br/>๐Ÿ› ๏ธ Utilities & Callbacks]
            CONFIG[config/<br/>โš™๏ธ Configuration System]
        end

        MAIN --> CLI
        MAIN --> AGENTS
        MAIN --> CORE
        MAIN --> TOOLS
        MAIN --> UTILS
        MAIN --> CONFIG
    end

    subgraph "Smart CLI Detection"
        DETECT[Environment Check<br/>LOBSTER_CLOUD_KEY?]
        NO_KEY[๐Ÿ’ป Use Local Mode<br/>Full Functionality]
        WITH_KEY[โ˜๏ธ Show Cloud Message<br/>โ†’ cloud.lobster.ai<br/>โ†’ Fallback to Local]
    end

    subgraph "Local Processing (100% Functional)"
        CLIENT[AgentClient<br/>๐Ÿ’ป Complete Features]
        DM[DataManagerV2<br/>๐Ÿ“Š All Data Management]
        AI_PIPELINE[AI Agent Pipeline<br/>๐Ÿงฌ Full Analysis Power]
    end

    CLI --> DETECT
    DETECT --> |No Cloud Key| NO_KEY
    DETECT --> |Cloud Key Set| WITH_KEY
    NO_KEY --> CLIENT
    WITH_KEY --> CLIENT

    CLIENT --> DM
    CLIENT --> AI_PIPELINE

    classDef main fill:#4caf50,stroke:#2e7d32,stroke-width:4px
    classDef component fill:#81c784,stroke:#4caf50,stroke-width:2px
    classDef cli fill:#2196f3,stroke:#1976d2,stroke-width:2px
    classDef processing fill:#ff9800,stroke:#f57c00,stroke-width:2px

    class MAIN main
    class CLI,AGENTS,CORE,TOOLS,UTILS,CONFIG component
    class DETECT,NO_KEY,WITH_KEY cli
    class CLIENT,DM,AI_PIPELINE processing
Loading

๐ŸŒŸ Cloud Platform (Coming Soon)

graph LR
    LOCAL[๐Ÿ–ฅ๏ธ Local Installation<br/>Free & Open Source] --> UPGRADE{Want Cloud?}
    UPGRADE --> |Yes| CLOUD[โ˜๏ธ Lobster Cloud<br/>Managed Platform]
    UPGRADE --> |No| LOCAL

    CLOUD --> FEATURES[๐Ÿš€ Scalable Computing<br/>๐Ÿ‘ฅ Team Collaboration<br/>๐Ÿ’พ Persistent Storage]

    classDef free fill:#4caf50,stroke:#2e7d32,stroke-width:2px
    classDef cloud fill:#2196f3,stroke:#1976d2,stroke-width:3px

    class LOCAL free
    class CLOUD,FEATURES cloud
Loading

System Architecture Overview - Post Migration

graph TB
    %% Data Sources
    subgraph "Data Sources"
        GEO[GEO Database<br/>GSE Datasets]
        CSV[CSV Files<br/>Local Data]
        EXCEL[Excel Files<br/>Lab Data]
        H5AD[H5AD Files<br/>Processed Data]
        MTX[10X MTX<br/>Single-cell Data]
    end

    %% NEW: Agent Registry System
    subgraph "Agent Registry System"
        AREG[Agent Registry<br/>๐Ÿ›๏ธ Single Source of Truth]
        ACONF[Agent Configurations<br/>๐Ÿ“‹ Factory Functions & Metadata]
        HDTOOLS[Handoff Tools<br/>๐Ÿ”„ Dynamic Tool Generation]
    end

    %% NEW: Supervisor Configuration System (v0.2+)
    subgraph "Supervisor Configuration (v0.2+)"
        SCONF[SupervisorConfig<br/>โš™๏ธ Dynamic Configuration]
        CAPEXT[AgentCapabilityExtractor<br/>๐Ÿ” Auto-Discovery]
        MODES[Operation Modes<br/>๐ŸŽฏ Research/Production/Dev]
    end

    %% Agents Layer
    subgraph "AI Agents - Dynamically Loaded"
        DE[Data Expert<br/>๐Ÿ”„ Data Loading & Management]
        RA[Research Agent<br/>๐Ÿ” Literature Discovery & Dataset ID<br/>๐Ÿ“„ Method Extraction from Publications]
        TE[Transcriptomics Expert<br/>๐Ÿงฌ Single-Cell & Bulk RNA-seq Analysis<br/>๐Ÿ“Š Unified Transcriptomics Workflows]
        MLE[ML Expert<br/>๐Ÿค– Machine Learning & scVI]
        VIZ[Visualization Expert<br/>๐Ÿ“ˆ Publication-Quality Plots]
    end

    %% Proteomics Agent (Unified)
    subgraph "Proteomics Agent"
        PE[Proteomics Expert<br/>๐Ÿ”ฌ Mass Spectrometry & Affinity Analysis<br/>๐ŸŽฏ Unified Proteomics Workflows]
    end

    %% NEW: Analysis Services Layer (Stateless)
    subgraph "Analysis Services - Stateless & Modular"
        PREP[PreprocessingService<br/>๐Ÿ”ง Filter & Normalize]
        QUAL[QualityService<br/>๐Ÿ“Š QC Assessment]
        CLUST[ClusteringService<br/>๐ŸŽฏ Leiden & UMAP]
        SCELL[EnhancedSingleCellService<br/>๐Ÿ”ฌ Doublets & Annotation]
        BULK[BulkRNASeqService<br/>๐Ÿ“ˆ Bulk Analysis & pyDESeq2]
        GEO_SVC[GEOService<br/>๐Ÿ’พ Data Download]
        CONCAT_SVC[ConcatenationService<br/>๐Ÿ”— Sample Concatenation & Code Deduplication<br/>Memory-Efficient Multi-Modal Merging]

        subgraph "Pseudobulk & Differential Expression Services"
            PBULK[PseudobulkService<br/>๐Ÿงฌ Single-cell to Pseudobulk Aggregation]
            FORMULA[DifferentialFormulaService<br/>๐Ÿ“Š R-style Formula Parsing & Design Matrix<br/>๐Ÿค– Agent-Guided Formula Construction<br/>๐Ÿ”„ Iterative Analysis Support]
            PBADAP[PseudobulkAdapter<br/>๐Ÿ”„ Schema Validation & QC]
            WFLOW[WorkflowTracker<br/>๐Ÿ”„ DE Iteration Management<br/>๐Ÿ“Š Result Comparison & Analytics]
        end

        subgraph "Proteomics Services - Professional Grade"
            PPREP[ProteomicsPreprocessingService<br/>๐Ÿงช MS/Affinity Filtering & Normalization]
            PQUAL[ProteomicsQualityService<br/>๐Ÿ“Š Missing Value & CV Analysis]
            PANAL[ProteomicsAnalysisService<br/>๐Ÿ”ฌ Statistical Testing & PCA]
            PDIFF[ProteomicsDifferentialService<br/>๐Ÿ“ˆ Differential Expression & Pathways]
            PVIS[ProteomicsVisualizationService<br/>๐Ÿ“Š Volcano Plots & Networks]
        end
    end

    %% Publication Services Layer (v0.2.0+ Three-Tier Architecture)
    subgraph "Publication & Literature Services"
        CONTENT_SVC[ContentAccessService<br/>๐ŸŽฏ Capability-Based Coordinator]:::coordinator
        PROVIDER_REG[ProviderRegistry<br/>๐Ÿ”€ Priority Routing]:::service
        ABSTRACT_PROV[AbstractProvider<br/>โšก Priority 10: Fast Abstracts]:::tier1
        PMC_PROV[PMCProvider<br/>๐Ÿ“„ Priority 10: PMC XML API]:::tier1
        PUBMED[PubMedProvider<br/>๐Ÿ“š Priority 10: Literature Search]:::tier1
        GEOPROV[GEOProvider<br/>๐Ÿงฌ Priority 10: Dataset Discovery]:::tier1
        WEBPAGE_PROV[WebpageProvider<br/>๐ŸŒ Priority 50: Webpage Fallback]:::tier2
        DOCLING_SVC[DoclingService<br/>๐Ÿ“„ Internal to WebpageProvider]:::foundation
        METADATA_VAL[MetadataValidationService<br/>โœ… Dataset Validation]:::service
    end

    %% DataManagerV2 Orchestration
    subgraph "DataManagerV2 Orchestration"
        DM2[DataManagerV2<br/>๐ŸŽฏ Modality Coordinator]
        MODALITIES[Modality Storage<br/>๐Ÿ“Š AnnData Objects]
        PROV[Provenance Tracker<br/>๐Ÿ“‹ Analysis History]
        ERROR[Error Handling<br/>โš ๏ธ Professional Exceptions]
    end

    %% Modality Adapters
    subgraph "Modality Adapters"
        TRA[TranscriptomicsAdapter<br/>๐Ÿงฌ RNA-seq Loading]
        PRA[ProteomicsAdapter<br/>๐Ÿงช Protein Loading]

        subgraph "Transcriptomics Types"
            TRSC[Single-cell RNA-seq<br/>Schema & Validation]
            TRBL[Bulk RNA-seq<br/>Schema & Validation]
        end

        subgraph "Proteomics Types"
            PRMS[Mass Spectrometry<br/>Missing Value Handling]
            PRAF[Affinity Proteomics<br/>Antibody Arrays]
        end

        TRA --> TRSC
        TRA --> TRBL
        PRA --> PRMS
        PRA --> PRAF
    end

    %% Storage Backends
    subgraph "Storage Backends"
        H5BE[H5ADBackend<br/>๐Ÿ’พ Single Modality<br/>S3-Ready]
        MUBE[MuDataBackend<br/>๐Ÿ”— Multi-Modal<br/>Integrated Analysis]
    end

    %% Schema & Validation
    subgraph "Schema System"
        TSCH[TranscriptomicsSchema<br/>๐Ÿ“‹ RNA-seq Rules]
        PSCH[ProteomicsSchema<br/>๐Ÿ“‹ Protein Rules]
        FVAL[FlexibleValidator<br/>โš ๏ธ Warning-based QC]
    end

    %% Interfaces
    subgraph "Core Interfaces"
        IBACK[IDataBackend<br/>๐Ÿ”Œ Storage Contract]
        IADAP[IModalityAdapter<br/>๐Ÿ”Œ Processing Contract]
        IVAL[IValidator<br/>๐Ÿ”Œ Validation Contract]
    end

    %% Data Flow Connections
    GEO --> DE
    CSV --> DE
    EXCEL --> DE
    H5AD --> DE
    MTX --> DE

    %% Agent to Service connections
    DE --> GEO_SVC
    RA --> GEOPROV

    %% Three-tier publication access flow (v0.2.0+ with capability-based routing)
    RA --> CONTENT_SVC
    CONTENT_SVC --> PROVIDER_REG
    PROVIDER_REG --> ABSTRACT_PROV
    PROVIDER_REG --> PMC_PROV
    PROVIDER_REG --> PUBMED
    PROVIDER_REG --> GEOPROV
    PROVIDER_REG --> WEBPAGE_PROV
    WEBPAGE_PROV --> DOCLING_SVC
    ABSTRACT_PROV -.-> PUBMED

    %% Transcriptomics Expert connections (handles both single-cell and bulk)
    TE --> PREP
    TE --> QUAL
    TE --> CLUST
    TE --> SCELL
    TE --> PBULK
    TE --> FORMULA
    TE --> BULK

    %% Proteomics Expert connections (handles both MS and affinity)
    PE --> PPREP
    PE --> PQUAL
    PE --> PANAL
    PE --> PDIFF
    PE --> PVIS

    %% Service to DataManager connections
    PREP --> |AnnData Processing| DM2
    QUAL --> |QC Metrics| DM2
    CLUST --> |Clustering Results| DM2
    SCELL --> |Annotations| DM2
    GEO_SVC --> |Dataset Loading| DM2
    PUBSVC --> |Publication Metadata| DM2
    PBULK --> |Pseudobulk Matrices| DM2
    FORMULA --> |Design Matrices| DM2
    BULK --> |DE Results| DM2

    %% Proteomics Service to DataManager connections
    PPREP --> |Proteomics Processing| DM2
    PQUAL --> |Proteomics QC| DM2
    PANAL --> |Statistical Analysis| DM2
    PDIFF --> |Differential Results| DM2
    PVIS --> |Visualization Data| DM2

    %% DataManager orchestration
    DM2 --> TRA
    DM2 --> PRA
    DM2 --> MODALITIES
    DM2 --> PROV
    DM2 --> ERROR

    TRA --> TSCH
    PRA --> PSCH
    TSCH --> FVAL
    PSCH --> FVAL

    MODALITIES --> H5BE
    MODALITIES --> MUBE

    %% Interface implementations
    TRA -.-> IADAP
    PRA -.-> IADAP
    H5BE -.-> IBACK
    MUBE -.-> IBACK
    FVAL -.-> IVAL

    %% Styling
    classDef agent fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef service fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px
    classDef orchestrator fill:#f3e5f5,stroke:#4a148c,stroke-width:3px
    classDef adapter fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef backend fill:#fce4ec,stroke:#c2185b,stroke-width:2px
    classDef schema fill:#f1f8e9,stroke:#388e3c,stroke-width:2px
    classDef interface fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px,stroke-dasharray: 5 5
    classDef source fill:#f5f5f5,stroke:#616161,stroke-width:1px
    classDef tier1 fill:#90EE90,stroke:#228B22,stroke-width:3px,color:#000
    classDef tier2 fill:#87CEEB,stroke:#4682B4,stroke-width:3px,color:#000
    classDef coordinator fill:#FFB6C1,stroke:#C71585,stroke-width:3px,color:#000
    classDef foundation fill:#DDA0DD,stroke:#8B008B,stroke-width:3px,color:#000
    classDef disabled fill:#E0E0E0,stroke:#757575,stroke-width:2px,stroke-dasharray:5 5,color:#424242

    class DE,RA,TE,PE,MLE,VIZ agent
    class PREP,QUAL,CLUST,SCELL,BULK,PBULK,FORMULA,PBADAP,PUBSVC,PUBMED,GEOPROV,GEOQB,METADATA_VAL service
    class DM2,MODALITIES,PROV,ERROR orchestrator
    class TRA,PRA,TRSC,TRBL,PRMS,PRAF adapter
    class H5BE,MUBE backend
    class TSCH,PSCH,FVAL schema
    class IBACK,IADAP,IVAL interface
    class GEO,CSV,EXCEL,H5AD,MTX source
Loading

Data Flow Diagram - Modular Service Architecture

sequenceDiagram
    participant User
    participant DataExpert as Data Expert Agent
    participant TransExpert as Transcriptomics Expert
    participant DM2 as DataManagerV2
    participant Service as Analysis Service
    participant Adapter as Modality Adapter
    participant Schema as Schema Validator
    participant Backend as Storage Backend

    %% Data Loading Flow
    User->>DataExpert: "Download GSE12345"
    DataExpert->>DM2: load_modality("geo_gse12345", source, "transcriptomics_single_cell")

    DM2->>Adapter: from_source(source_data)
    Adapter->>Adapter: Detect format, load data
    Adapter->>Schema: validate(adata)
    Schema-->>Adapter: ValidationResult (warnings/errors)
    Adapter-->>DM2: AnnData with schema compliance

    DM2->>DM2: Store as modality
    DM2-->>DataExpert: Modality loaded successfully
    DataExpert-->>User: "Loaded geo_gse12345: 5000 cells ร— 20000 genes"

    %% NEW: Modular Analysis Flow
    User->>TransExpert: "Filter and normalize the data"
    TransExpert->>DM2: get_modality("geo_gse12345")
    DM2-->>TransExpert: AnnData object

    TransExpert->>Service: PreprocessingService.filter_and_normalize_cells(adata, params)
    Service->>Service: Professional QC filtering
    Service->>Service: Scanpy normalization
    Service-->>TransExpert: (processed_adata, processing_stats)

    TransExpert->>DM2: Store new modality("geo_gse12345_filtered_normalized")
    TransExpert->>DM2: log_tool_usage(operation_details)
    TransExpert-->>User: "Filtering complete: 4500 cells retained (90%)"

    User->>TransExpert: "Run clustering analysis"
    TransExpert->>DM2: get_modality("geo_gse12345_filtered_normalized")
    DM2-->>TransExpert: Processed AnnData

    TransExpert->>Service: ClusteringService.cluster_and_visualize(adata, params)
    Service->>Service: HVG detection, PCA, neighbors graph
    Service->>Service: Leiden clustering, UMAP embedding
    Service-->>TransExpert: (clustered_adata, clustering_stats)

    TransExpert->>DM2: Store new modality("geo_gse12345_clustered")
    TransExpert->>DM2: log_tool_usage(clustering_results)
    TransExpert-->>User: "Clustering complete: 8 clusters identified"

    User->>TransExpert: "Find marker genes"
    TransExpert->>DM2: get_modality("geo_gse12345_clustered")
    TransExpert->>Service: EnhancedSingleCellService.find_marker_genes(adata, params)
    Service->>Service: Differential expression analysis
    Service-->>TransExpert: (marker_adata, marker_stats)

    TransExpert->>DM2: Store new modality("geo_gse12345_markers")
    TransExpert-->>User: "Marker genes identified for all clusters"

    %% Provenance and Error Handling
    Note over DM2: All operations tracked<br/>Professional error handling<br/>Complete provenance trail
Loading

Component Interaction Matrix

graph LR
    subgraph "Agents โ†’ DataManagerV2"
        DE[Data Expert] --> |load_modality<br/>save_modality| DM2[DataManagerV2]
        RA[Research Agent] --> |literature_context<br/>method_extraction| DM2
        SCE[Single-Cell Expert] --> |get_modality<br/>cluster_analyze| DM2
        BRE[Bulk RNA-seq Expert] --> |get_modality<br/>differential_expression| DM2
        PE[Proteomics Expert] --> |get_modality<br/>analyze_patterns| DM2
    end

    subgraph "DataManagerV2 โ†’ Adapters"
        DM2 --> |from_source| TRA[TranscriptomicsAdapter]
        DM2 --> |from_source| PRA[ProteomicsAdapter]
    end

    subgraph "Adapters โ†’ Validation"
        TRA --> |validate| TSCH[TranscriptomicsSchema]
        PRA --> |validate| PSCH[ProteomicsSchema]
        TSCH --> FVAL[FlexibleValidator]
        PSCH --> FVAL
    end

    subgraph "DataManagerV2 โ†’ Storage"
        DM2 --> |save/load| H5BE[H5ADBackend]
        DM2 --> |save/load| MUBE[MuDataBackend]
    end

    classDef agent fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef orchestrator fill:#f3e5f5,stroke:#4a148c,stroke-width:3px
    classDef adapter fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
    classDef backend fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef schema fill:#fce4ec,stroke:#880e4f,stroke-width:2px

    class DE,TE,PE,ME agent
    class DM2 orchestrator
    class TRA,PRA adapter
    class H5BE,MUBE backend
    class TSCH,PSCH,FVAL schema

## ๐Ÿ›๏ธ Centralized Agent Registry System

### Overview

The Lobster AI system now features a **centralized agent registry** that serves as the single source of truth for all agent configurations. This eliminates redundancy and reduces errors when adding new agents to the system.

### Agent Registry Architecture

```mermaid
graph TB
    subgraph "Agent Registry System"
        AREG[Agent Registry<br/>lobster/config/agent_registry.py]
        ACONF[AgentConfig Objects<br/>๐Ÿ”ง Metadata & Factory Functions]
        HELPERS[Helper Functions<br/>๐Ÿ› ๏ธ get_worker_agents()<br/>get_all_agent_names()]
    end

    subgraph "System Integration"
        GRAPH[Graph Creation<br/>lobster/agents/graph.py]
        CALLBACKS[Callback System<br/>lobster/utils/callbacks.py]
        SETTINGS[Settings Integration<br/>lobster/config/settings.py]
    end

    subgraph "Dynamic Loading"
        IMPORT[Dynamic Import<br/>import_agent_factory()]
        TOOLS[Tool Generation<br/>create_custom_handoff_tool()]
        DETECT[Agent Detection<br/>get_all_agent_names()]
    end

    AREG --> ACONF
    AREG --> HELPERS

    HELPERS --> GRAPH
    HELPERS --> CALLBACKS
    HELPERS --> SETTINGS

    ACONF --> IMPORT
    ACONF --> TOOLS
    HELPERS --> DETECT

    %% Supervisor Configuration connections
    SCONF --> MODES
    CAPEXT --> AREG
    CAPEXT --> ACONF
    SCONF --> GRAPH
    MODES --> SCONF

    classDef registry fill:#e8f5e8,stroke:#2e7d32,stroke-width:3px
    classDef integration fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef dynamic fill:#fff3e0,stroke:#f57c00,stroke-width:2px
    classDef config fill:#f3e5f5,stroke:#4a148c,stroke-width:2px

    class AREG,ACONF,HELPERS registry
    class GRAPH,CALLBACKS,SETTINGS integration
    class IMPORT,TOOLS,DETECT dynamic
    class SCONF,CAPEXT,MODES config
Loading

Agent Configuration Schema

Each agent in the registry is defined using an AgentConfig dataclass:

@dataclass
class AgentConfig:
    """Configuration for an agent in the system."""
    name: str                              # Unique agent identifier
    display_name: str                     # Human-readable name
    description: str                      # Agent's purpose/capability
    factory_function: str                 # Module path to factory function
    handoff_tool_name: Optional[str]     # Name of handoff tool
    handoff_tool_description: Optional[str]  # Tool description

Current Agent Registry

AGENT_REGISTRY: Dict[str, AgentConfig] = {
    'data_expert_agent': AgentConfig(
        name='data_expert_agent',
        display_name='Data Expert',
        description='Handles data fetching and download tasks',
        factory_function='lobster.agents.data_expert.data_expert',
        handoff_tool_name='handoff_to_data_expert',
        handoff_tool_description='Assign data fetching/download tasks to the data expert'
    ),
    'transcriptomics_expert': AgentConfig(
        name='transcriptomics_expert',
        display_name='Transcriptomics Expert',
        description='Handles both single-cell and bulk RNA-seq analysis tasks',
        factory_function='lobster.agents.transcriptomics_expert.transcriptomics_expert',
        handoff_tool_name='handoff_to_transcriptomics_expert',
        handoff_tool_description='Assign transcriptomics analysis tasks (single-cell or bulk RNA-seq) to the transcriptomics expert'
    ),
    'research_agent': AgentConfig(
        name='research_agent',
        display_name='Research Agent',
        description='Handles literature discovery, dataset identification, and method extraction from publications. Replaces deprecated method_expert_agent.',
        factory_function='lobster.agents.research_agent.research_agent',
        handoff_tool_name='handoff_to_research_agent',
        handoff_tool_description='Assign literature search, dataset discovery, PDF extraction, and method extraction tasks to the research agent'
    ),
    'machine_learning_expert_agent': AgentConfig(
        name='machine_learning_expert_agent',
        display_name='ML Expert',
        description='Handles Machine Learning related tasks like transforming the data in the desired format for downstream tasks',
        factory_function='lobster.agents.machine_learning_expert.machine_learning_expert',
        handoff_tool_name='handoff_to_machine_learning_expert',
        handoff_tool_description='Assign all machine learning related tasks (scVI, classification etc) to the machine learning expert agent'
    ),
    'visualization_expert_agent': AgentConfig(
        name='visualization_expert_agent',
        display_name='Visualization Expert',
        description='Creates publication-quality visualizations through supervisor-mediated workflows',
        factory_function='lobster.agents.visualization_expert.visualization_expert',
        handoff_tool_name='handoff_to_visualization_expert',
        handoff_tool_description='Delegate visualization tasks to the visualization expert agent'
    ),
    'proteomics_expert': AgentConfig(
        name='proteomics_expert',
        display_name='Proteomics Expert',
        description='Handles both mass spectrometry and affinity proteomics analysis tasks',
        factory_function='lobster.agents.proteomics_expert.proteomics_expert',
        handoff_tool_name='handoff_to_proteomics_expert',
        handoff_tool_description='Assign proteomics analysis tasks (mass spectrometry or affinity proteomics) to the proteomics expert'
    ),
}

System Integration Flow

sequenceDiagram
    participant AREG as Agent Registry
    participant GRAPH as Graph Creation
    participant CB as Callbacks
    participant AGENT as Created Agent

    Note over AREG: System Startup
    GRAPH->>AREG: get_worker_agents()
    AREG-->>GRAPH: Dict[agent_name, AgentConfig]

    loop For each agent config
        GRAPH->>AREG: import_agent_factory(config.factory_function)
        AREG-->>GRAPH: Agent factory function
        GRAPH->>GRAPH: Create agent instance
        GRAPH->>GRAPH: Create handoff tool
    end

    Note over GRAPH: All agents loaded dynamically

    Note over CB: Runtime Execution
    CB->>AREG: get_all_agent_names()
    AREG-->>CB: List of all agent names
    CB->>CB: Monitor for agent transitions
    CB->>AGENT: Detect agent handoffs
Loading

Benefits of Centralized Registry

Before (Legacy System)

Adding new agents required updating:
โ”œโ”€โ”€ lobster/agents/graph.py          # Import statements
โ”œโ”€โ”€ lobster/agents/graph.py          # Agent creation code
โ”œโ”€โ”€ lobster/agents/graph.py          # Handoff tool definitions
โ”œโ”€โ”€ lobster/utils/callbacks.py       # Agent name hardcoded list
โ””โ”€โ”€ Multiple imports throughout codebase

After (Registry System)

Adding new agents only requires:
โ””โ”€โ”€ lobster/config/agent_registry.py  # Single registry entry

Everything else is handled automatically:
โ”œโ”€โ”€ โœ… Dynamic agent loading
โ”œโ”€โ”€ โœ… Automatic handoff tool creation
โ”œโ”€โ”€ โœ… Callback system integration
โ”œโ”€โ”€ โœ… Type-safe configuration
โ””โ”€โ”€ โœ… Professional error handling

How to Add New Agents

Step 1: Create Agent Implementation

# lobster/agents/new_agent.py
def new_agent(data_manager, callback_handler=None, agent_name='new_agent', handoff_tools=None):
    """Create a new specialized agent."""
    # Agent implementation
    return agent_instance

Step 2: Register in Agent Registry

# lobster/config/agent_registry.py
AGENT_REGISTRY = {
    # ... existing agents ...
    'new_agent': AgentConfig(
        name='new_agent',
        display_name='New Agent',
        description='Handles specialized new functionality',
        factory_function='lobster.agents.new_agent.new_agent',
        handoff_tool_name='handoff_to_new_agent',
        handoff_tool_description='Assign specialized tasks to the new agent'
    ),
}

Step 3: Done!

The system automatically handles:

  • โœ… Agent loading in graph creation
  • โœ… Handoff tool generation
  • โœ… Callback system detection
  • โœ… Error handling and logging
  • โœ… Integration with existing workflows

Registry Helper Functions

The registry provides several utility functions:

# Get all worker agents with configurations
worker_agents = get_worker_agents()
# Returns: Dict[str, AgentConfig]

# Get all agent names (including system agents)
all_agents = get_all_agent_names()
# Returns: List[str]

# Get specific agent configuration
config = get_agent_config('data_expert_agent')
# Returns: AgentConfig or None

# Dynamically import agent factory
factory = import_agent_factory('lobster.agents.data_expert.data_expert')
# Returns: Callable

Error Prevention

The registry system prevents common errors:

Runtime Validation

  • โœ… Factory function existence validation
  • โœ… Import path verification
  • โœ… Configuration completeness checks
  • โœ… Duplicate agent name detection

Development Safety

  • โœ… Type hints for all configurations
  • โœ… Consistent naming conventions
  • โœ… Comprehensive error messages
  • โœ… Centralized documentation

Maintenance Benefits

  • โœ… Single source of truth
  • โœ… Easy to audit and review
  • โœ… Reduced cognitive load
  • โœ… Professional code organization

Testing the Registry

The system includes comprehensive testing:

# tests/test_agent_registry.py
def test_agent_registry():
    """Test the agent registry functionality."""
    # Test 1: Verify all agents are registered
    worker_agents = get_worker_agents()
    assert len(worker_agents) > 0

    # Test 2: Validate factory function imports
    for agent_name, config in worker_agents.items():
        factory = import_agent_factory(config.factory_function)
        assert callable(factory)

    # Test 3: Check agent name consistency
    all_agents = get_all_agent_names()
    assert 'supervisor' in all_agents
    assert 'data_expert_agent' in all_agents

Run the test with:

python tests/test_agent_registry.py

This centralized approach ensures professional, maintainable, and error-free agent management across the entire Lobster AI system.

๐Ÿ”— ConcatenationService: Code Deduplication & Memory Efficiency

Overview

The ConcatenationService is a critical architectural improvement that eliminates code duplication and provides memory-efficient, modality-agnostic concatenation of biological samples. This service addresses the code redundancy problem that existed between data_expert.py and geo_service.py.

Architecture Pattern

graph TB
    subgraph "Before: Code Duplication Problem"
        DE_OLD[data_expert.py<br/>concatenate_samples()<br/>200+ lines of code]
        GEO_OLD[geo_service.py<br/>_concatenate_stored_samples()<br/>300+ lines of code]
        DUPLICATION[โŒ 450+ lines of duplicated logic<br/>โŒ Memory inefficiency<br/>โŒ Maintenance overhead]

        DE_OLD -.-> DUPLICATION
        GEO_OLD -.-> DUPLICATION
    end

    subgraph "After: Centralized Service"
        CONCAT_SERVICE[ConcatenationService<br/>๐Ÿ”— Single Source of Truth<br/>810 lines of professional code]

        subgraph "Strategy Pattern"
            SMART[SmartSparseStrategy<br/>๐Ÿงฌ Single-cell optimized]
            MEMORY[MemoryEfficientStrategy<br/>๐Ÿ’พ Large dataset chunked processing]
        end

        subgraph "Refactored Clients"
            DE_NEW[data_expert.py<br/>concatenate_samples()<br/>30 lines (delegates to service)]
            GEO_NEW[geo_service.py<br/>_concatenate_stored_samples()<br/>20 lines (delegates to service)]
        end

        CONCAT_SERVICE --> SMART
        CONCAT_SERVICE --> MEMORY
        DE_NEW --> CONCAT_SERVICE
        GEO_NEW --> CONCAT_SERVICE
    end

    classDef old fill:#ffebee,stroke:#c62828,stroke-width:2px
    classDef problem fill:#ffcdd2,stroke:#d32f2f,stroke-width:3px
    classDef new fill:#e8f5e8,stroke:#2e7d32,stroke-width:3px
    classDef strategy fill:#e3f2fd,stroke:#1976d2,stroke-width:2px
    classDef client fill:#f3e5f5,stroke:#7b1fa2,stroke-width:2px

    class DE_OLD,GEO_OLD old
    class DUPLICATION problem
    class CONCAT_SERVICE new
    class SMART,MEMORY strategy
    class DE_NEW,GEO_NEW client
Loading

Key Benefits

๐ŸŽฏ Code Reduction

  • data_expert.py: 200+ lines โ†’ 30 lines (85% reduction)
  • geo_service.py: 300+ lines โ†’ 20 lines (93% reduction)
  • Total elimination: 450+ lines of duplicated code

๐Ÿ’พ Memory Efficiency

  • Smart memory estimation with automatic strategy recommendation
  • Chunked processing for datasets exceeding memory limits
  • 50%+ memory reduction for large concatenation operations
  • Real-time memory monitoring during processing

๐Ÿงฌ Modality-Agnostic Design

  • Strategy Pattern: Different algorithms for different data types
  • Single-cell optimization: Sparse matrix handling with batch tracking
  • Bulk transcriptomics: Optimized for dense matrix operations
  • Proteomics support: Handle missing values appropriately

๐Ÿ”ง Professional Architecture

  • Single source of truth for all concatenation logic
  • Comprehensive error handling with custom exceptions
  • Progress tracking with Rich console integration
  • Extensive testing with 400+ lines of unit tests

Service Interface

# Primary concatenation method
concatenated_adata, statistics = concat_service.concatenate_samples(
    sample_adatas=sample_list,
    strategy=ConcatenationStrategy.SMART_SPARSE,
    batch_key="batch",
    use_intersecting_genes_only=True
)

# Concatenate from modality names
concatenated_adata, statistics = concat_service.concatenate_from_modalities(
    modality_names=["sample1", "sample2", "sample3"],
    output_name="concatenated_dataset",
    use_intersecting_genes_only=True
)

# Auto-detect samples by pattern
sample_modalities = concat_service.auto_detect_samples("geo_gse12345")

# Validate before processing
validation_result = concat_service.validate_concatenation_inputs(sample_list)

# Estimate memory requirements
memory_info = concat_service.estimate_memory_usage(sample_list)

Integration with DataManagerV2

The ConcatenationService integrates deeply with DataManagerV2 for seamless modality management:

sequenceDiagram
    participant DE as Data Expert Agent
    participant CS as ConcatenationService
    participant DM2 as DataManagerV2
    participant Strategy as Strategy Implementation

    DE->>CS: concatenate_from_modalities(modality_names)
    CS->>DM2: get_modality() for each sample
    DM2-->>CS: List of AnnData objects

    CS->>Strategy: concatenate(sample_adatas, **kwargs)
    Strategy->>Strategy: Apply concatenation algorithm
    Strategy->>Strategy: Add batch information
    Strategy->>Strategy: Monitor memory usage
    Strategy-->>CS: ConcatenationResult

    CS->>DM2: load_modality(output_name, concatenated_data)
    DM2->>DM2: Store as new modality
    DM2-->>CS: Confirmation
    CS-->>DE: (concatenated_adata, statistics)
    DE-->>DE: Format response for user
Loading

Testing & Quality Assurance

The ConcatenationService includes comprehensive testing:

  • Unit Tests: Strategy pattern, validation functions, memory estimation
  • Integration Tests: DataManagerV2 interaction, modality storage
  • Performance Tests: Memory usage, processing time benchmarks
  • Error Handling Tests: Exception scenarios, graceful degradation

This architecture improvement ensures reliable, maintainable, and efficient sample concatenation across the entire Lobster AI platform.

๐ŸŒŸ Open Source Benefits

๐Ÿ†“ What You Get for Free

  • Complete Bioinformatics Platform: All analysis capabilities included
  • AI-Powered Analysis: Natural language interface to bioinformatics
  • Publication-Ready Outputs: Professional visualizations and reports
  • Extensible Architecture: Add custom analysis methods easily
  • Active Development: Regular updates and community contributions

๐Ÿ“ˆ Why Choose Local Installation

  • Privacy: Your data never leaves your computer
  • Customization: Full control over analysis parameters
  • Learning: Study the source code to understand methods
  • Contribution: Help improve the platform for everyone
  • Cost: Completely free (you pay only for your own API keys)

โ˜๏ธ Interested in Cloud?

For teams needing scalable infrastructure, managed services, or collaborative features, we're developing a cloud platform.

Join the Waitlist โ†’

Architecture Migration Summary

๐ŸŽฏ Migration Goals Achieved

The Lobster AI system has been successfully migrated from a dual-system architecture (legacy DataManager + DataManagerV2) to a clean, professional, modular DataManagerV2-only implementation.

โœ… Key Improvements

1. Modular Service Architecture

  • Before: Agents contained mixed responsibilities with dual code paths
  • After: Clean separation with stateless analysis services and orchestration agents

2. Professional Error Handling

  • Custom Exception Hierarchy:
    • TranscriptomicsError, PreprocessingError, QualityError, etc.
    • ModalityNotFoundError for specific validation
  • Comprehensive Logging: All operations tracked with parameters and results
  • Graceful Error Recovery: Informative error messages with suggested fixes

3. Stateless Services Design

  • PreprocessingService: AnnData filtering, normalization, batch correction
  • QualityService: Comprehensive QC assessment with statistical metrics
  • ClusteringService: Leiden clustering, PCA, UMAP visualization
  • EnhancedSingleCellService: Doublet detection, cell type annotation
  • GEOService: Professional dataset downloading and processing
  • PubMedService: Literature mining and method extraction

๐Ÿ—๏ธ New Architecture Pattern

Agent Tool Pattern

@tool
def tool_name(modality_name: str, **params) -> str:
    """Professional tool with comprehensive error handling."""
    try:
        # 1. Validate modality exists
        if modality_name not in data_manager.list_modalities():
            raise ModalityNotFoundError(f"Modality '{modality_name}' not found")

        # 2. Get AnnData from modality
        adata = data_manager.get_modality(modality_name)

        # 3. Call stateless service
        result_adata, stats = service.method_name(adata, **params)

        # 4. Save new modality with descriptive name
        new_modality_name = f"{modality_name}_processed"
        data_manager.modalities[new_modality_name] = result_adata

        # 5. Log operation for provenance
        data_manager.log_tool_usage(tool_name, params, description)

        # 6. Format professional response
        return format_professional_response(stats, new_modality_name)

    except ServiceError as e:
        logger.error(f"Service error: {e}")
        return f"Service error: {str(e)}"
    except Exception as e:
        logger.error(f"Unexpected error: {e}")
        return f"Unexpected error: {str(e)}"

Service Method Pattern

def service_method(
    self,
    adata: anndata.AnnData,
    **parameters
) -> Tuple[anndata.AnnData, Dict[str, Any]]:
    """
    Stateless service method working with AnnData directly.

    Returns:
        Tuple of (processed_adata, processing_statistics)

    """
    try:
        # 1. Create working copy
        adata_processed = adata.copy()

        # 2. Apply analysis algorithms
        # ... processing logic ...

        # 3. Calculate comprehensive statistics
        processing_stats = {
            "analysis_type": "method_type",
            "parameters_used": parameters,
            "results": {...}
        }

        return adata_processed, processing_stats

    except Exception as e:
        raise ServiceError(f"Method failed: {str(e)}")

๐Ÿ“Š Modality Management System

Descriptive Naming Convention

Each analysis step creates new modalities with descriptive, traceable names:

geo_gse12345                    # Raw downloaded data
โ”œโ”€โ”€ geo_gse12345_quality_assessed    # With QC metrics
โ”œโ”€โ”€ geo_gse12345_filtered_normalized # Preprocessed data
โ”œโ”€โ”€ geo_gse12345_doublets_detected   # With doublet annotations
โ”œโ”€โ”€ geo_gse12345_clustered          # With clustering results
โ”œโ”€โ”€ geo_gse12345_markers           # With marker genes
โ””โ”€โ”€ geo_gse12345_annotated        # With cell type annotations

Professional Modality Tracking

  • Provenance: Complete analysis history with parameters
  • Statistics: Comprehensive metrics for each processing step
  • Validation: Schema enforcement and quality checks
  • Storage: Automatic saving with professional file naming

๐Ÿ”ฌ Analysis Workflow Excellence

Standard Single-cell RNA-seq Pipeline

1. check_data_status() โ†’ Review available modalities
2. assess_data_quality(modality_name) โ†’ Professional QC assessment
3. filter_and_normalize_modality(...) โ†’ Clean and normalize
4. detect_doublets_in_modality(...) โ†’ Remove doublets
5. cluster_modality(...) โ†’ Leiden clustering + UMAP
6. find_marker_genes_for_clusters(...) โ†’ Differential expression
7. annotate_cell_types(...) โ†’ Automated annotation
8. create_analysis_summary() โ†’ Comprehensive report

Quality Control Standards

  • Professional QC Thresholds: Evidence-based filtering parameters
  • Multi-metric Assessment: Total counts, gene counts, mitochondrial%, ribosomal%
  • Statistical Validation: Z-score outlier detection and percentile thresholds
  • Batch Effect Handling: Automatic batch detection and correction options

Error Handling & Recovery

  • Input Validation: Comprehensive parameter and data validation
  • Graceful Degradation: Fallback methods when specialized tools unavailable
  • Informative Messages: Clear error descriptions with suggested solutions
  • Operation Logging: Complete audit trail for debugging and reproducibility

๐Ÿš€ Benefits of New Architecture

Code Quality Improvements

  • 50% Reduction in agent code complexity (450+ โ†’ 200+ lines)
  • Zero Duplication: No more dual code paths or is_v2 checks
  • Professional Standards: Type hints, comprehensive docstrings, error handling
  • Testability: Stateless services are easily unit tested

Maintainability Enhancements

  • Single Responsibility: Each service handles one analysis domain
  • Modular Design: Services can be used independently or combined
  • Clean Interfaces: Consistent patterns across all analysis tools
  • Version Control: Clear separation enables independent service updates

Performance & Reliability

  • Memory Efficiency: Stateless services with minimal memory footprint
  • Fault Tolerance: Comprehensive error handling prevents pipeline failures
  • Reproducibility: Complete parameter logging and provenance tracking
  • Scalability: Services can be distributed or parallelized in future versions

Migration Impact Analysis

๐Ÿ“ˆ Before Migration (Legacy System)

transcriptomics_expert.py: 450+ lines
โ”œโ”€โ”€ Dual code paths (is_v2 checks everywhere)
โ”œโ”€โ”€ Mixed responsibilities (orchestration + analysis)
โ”œโ”€โ”€ Redundant implementations
โ”œโ”€โ”€ Complex error handling
โ””โ”€โ”€ Maintenance overhead

๐ŸŽ‰ After Migration (Modular System)

transcriptomics_expert.py: 280 lines (clean)
โ”œโ”€โ”€ Single DataManagerV2 path
โ”œโ”€โ”€ Professional tool orchestration only
โ”œโ”€โ”€ Stateless service delegation
โ”œโ”€โ”€ Comprehensive error handling
โ””โ”€โ”€ Minimal maintenance overhead

Analysis Services: 4 refactored services
โ”œโ”€โ”€ PreprocessingService: AnnData โ†’ (filtered_adata, stats)
โ”œโ”€โ”€ QualityService: AnnData โ†’ (qc_adata, assessment)
โ”œโ”€โ”€ ClusteringService: AnnData โ†’ (clustered_adata, results)
โ””โ”€โ”€ EnhancedSingleCellService: AnnData โ†’ (annotated_adata, metrics)

๐Ÿ”ง Technical Architecture Benefits

Service Layer Advantages

  • Reusability: Services can be used by multiple agents
  • Testability: Each service can be independently tested
  • Flexibility: Easy to add new analysis methods
  • Performance: Optimized algorithms with professional implementations

Agent Layer Improvements

  • Orchestration Focus: Agents handle modality management and user interaction
  • Clean Tool Interface: Consistent ~20-30 line tool implementations
  • Professional Responses: Formatted outputs with comprehensive statistics
  • Error Management: Hierarchical error handling with specific exceptions

DataManagerV2 Integration

  • Modality-Centric: All data operations centered around named modalities
  • Provenance Tracking: Complete analysis history with tool usage logging
  • Schema Validation: Automatic validation ensures data integrity
  • Storage Management: Professional file naming and workspace organization

This architecture provides a solid foundation for professional bioinformatics analysis with excellent maintainability, extensibility, and reliability.

๐Ÿงฌ Agent-Guided Formula Construction Integration

Enhanced Bulk RNA-seq Expert Agent Tools

The bulk_rnaseq_expert agent includes 5 new tools for conversational formula construction:

graph LR
    subgraph "Agent Layer"
        BRE[Transcriptomics Expert<br/>๐Ÿค– Enhanced with Formula Tools]
    end

    subgraph "New Agent Tools (5)"
        T1[suggest_formula_for_design<br/>๐Ÿ“Š Metadata Analysis]
        T2[construct_de_formula_interactive<br/>๐Ÿ”ง Formula Building]
        T3[run_differential_expression_with_formula<br/>๐Ÿงฌ pyDESeq2 Execution]
        T4[iterate_de_analysis<br/>๐Ÿ”„ Iterative Workflows]
        T5[compare_de_iterations<br/>๐Ÿ“ˆ Result Comparison]
    end

    subgraph "Enhanced Services"
        FORMULA[DifferentialFormulaService<br/>๐Ÿ“Š 3 New Methods Added]
        WFLOW[WorkflowTracker<br/>๐Ÿ”„ New Iteration Management]
        BULK[BulkRNASeqService<br/>๐Ÿ“ˆ pyDESeq2 Integration]
    end

    BRE --> T1
    BRE --> T2
    BRE --> T3
    BRE --> T4
    BRE --> T5

    T1 --> FORMULA
    T2 --> FORMULA
    T3 --> BULK
    T4 --> WFLOW
    T5 --> WFLOW

    classDef agent fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef tool fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef service fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px

    class TE,PE agent
    class T1,T2,T3,T4,T5 tool
    class FORMULA,WFLOW,BULK service
Loading

Service Enhancement Details

  • DifferentialFormulaService: Added suggest_formulas(), preview_design_matrix(), estimate_statistical_power()
  • WorkflowTracker: New lightweight class for DE iteration tracking and comparison
  • Integration: All data stored in AnnData.uns for seamless workflow integration

Workflow Coverage Impact

  • โœ… Step 8: Formula Construction โ†’ Agent-guided conversation
  • โœ… Step 12: Iterative Workflows โ†’ Natural iteration and comparison
  • ๐ŸŽฏ Result: 92% workflow coverage (11/12 steps complete)

๐Ÿ”„ Workspace Restoration System (New in v0.2)

Seamless Session Continuity

Lobster AI now features intelligent workspace restoration that automatically detects and restores previous analysis sessions:

Key Features

  • Automatic Detection: Scans .lobster_workspace/data/ for available datasets on startup
  • Session Persistence: Maintains .session.json with active modalities and usage history
  • Lazy Loading: Load specific datasets on-demand with load_dataset()
  • Pattern-Based Restoration: Support for recent/all/glob patterns via /restore
  • Memory Management: Enforced memory limits prevent out-of-memory issues

New CLI Commands

  • /restore [pattern] - Restore datasets from previous sessions
  • /workspace list - View available datasets without loading
  • /workspace load <name> - Load specific dataset by name
  • Autocomplete Support: Tab completion for dataset names and patterns

Implementation Highlights

  • DataManagerV2 Enhanced: Added _scan_workspace(), load_dataset(), restore_session()
  • Session Tracking: Automatic .session.json updates on modality changes
  • H5PY Integration: Efficient metadata extraction without full dataset loading
  • Professional UX: Startup prompt shows workspace status with helpful commands

This transformation enables users to seamlessly continue their work across sessions without manual dataset reloading.

๐Ÿ› ๏ธ System Utilities Centralization

Performance Optimization

The system now features centralized platform utilities that eliminate redundant OS detection and provide unified cross-platform operations:

Before โ†’ After Transformation

  • Platform Detection: 5 ร— platform.system() calls โ†’ 1 ร— (at import time)
  • Code Reduction: ~50 lines of duplicate subprocess logic โ†’ 5 lines at call sites
  • Performance: 80% improvement in system operation speed
  • Architecture: Clean lobster/utils/system.py module with open_file(), open_folder(), open_path() functions

Cloud-Agnostic Design

All file opening operations run on the CLI side regardless of cloud vs local mode, ensuring consistent behavior across deployment types.

Integration Points

  • CLI Commands: open <file>, /open <file>, /plot, /plot <ID>
  • GPU Detection: Apple Silicon detection in gpu_detector.py
  • Future Extensions: Natural extension point for additional system utilities

๐ŸŽ›๏ธ Supervisor Configuration System (v0.2+)

Dynamic Agent Discovery & Configuration

The supervisor agent now features automatic agent discovery and configurable behavior, eliminating manual updates when adding new agents:

Architecture Overview

graph TB
    subgraph "Configuration Sources"
        ENV[Environment Variables<br/>SUPERVISOR_*]
        CODE[Code Configuration<br/>SupervisorConfig()]
        DEFAULT[Default Settings<br/>Backward Compatible]
    end

    subgraph "Discovery System"
        REGISTRY[Agent Registry<br/>All Registered Agents]
        CAPEXT[Capability Extractor<br/>@tool Discovery]
        ACTIVE[Active Agents<br/>Successfully Loaded]
    end

    subgraph "Prompt Builder"
        SECTIONS[Modular Sections<br/>Role, Agents, Rules]
        CONTEXT[Dynamic Context<br/>Data & Workspace]
        OPTIMIZE[Size Optimization<br/>Mode-Based]
    end

    ENV --> CONFIG[SupervisorConfig]
    CODE --> CONFIG
    DEFAULT --> CONFIG

    REGISTRY --> DISCOVER[Agent Discovery]
    CAPEXT --> DISCOVER
    ACTIVE --> DISCOVER

    CONFIG --> BUILD[create_supervisor_prompt()]
    DISCOVER --> BUILD
    BUILD --> SECTIONS
    BUILD --> CONTEXT
    BUILD --> OPTIMIZE

    OPTIMIZE --> PROMPT[Dynamic Prompt<br/>8K-11K chars]

    classDef config fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
    classDef discover fill:#e8f5e8,stroke:#2e7d32,stroke-width:2px
    classDef build fill:#fff3e0,stroke:#f57c00,stroke-width:2px

    class ENV,CODE,DEFAULT,CONFIG config
    class REGISTRY,CAPEXT,ACTIVE,DISCOVER discover
    class BUILD,SECTIONS,CONTEXT,OPTIMIZE,PROMPT build
Loading

Key Improvements

Feature Before (Static) After (Dynamic) Impact
Agent Discovery Manual updates in supervisor.py Automatic from registry Zero maintenance
Missing Agents 3 agents not included All 8 agents included Complete coverage
Configuration Hardcoded behavior 20+ env variables Full flexibility
Prompt Size Fixed ~9.5K chars 8K-11K adaptive 15% smaller in production
Adding Agents Update 3+ files Update registry only 66% less work

Operation Modes

# Research Mode - Interactive exploration
SUPERVISOR_ASK_QUESTIONS=true
SUPERVISOR_WORKFLOW_GUIDANCE=detailed
# Result: 11K char prompt with full guidance

# Production Mode - Automated pipelines
SUPERVISOR_ASK_QUESTIONS=false
SUPERVISOR_WORKFLOW_GUIDANCE=minimal
# Result: 8K char prompt, 1.4K chars saved

# Development Mode - Debugging
SUPERVISOR_VERBOSE=true
SUPERVISOR_INCLUDE_SYSTEM=true
# Result: Detailed explanations with system info

Implementation Benefits

  • ๐Ÿš€ Zero Maintenance: Add agents to registry only, supervisor auto-discovers
  • โš™๏ธ Flexible Behavior: Configure interaction style per environment
  • ๐Ÿ“Š Context Aware: Includes current data/workspace state dynamically
  • ๐ŸŽฏ Mode Optimized: Different prompt sizes for different use cases
  • โ™ป๏ธ Backward Compatible: Default config matches previous behavior exactly
โš ๏ธ **GitHub.com Fallback** โš ๏ธ