38 workspace content service - the-omics-os/lobster-local GitHub Wiki

Workspace Content Service

Overview

The WorkspaceContentService provides structured, type-safe caching of research content (publications, datasets, metadata) in the DataManagerV2 workspace. Introduced in Lobster v0.2+, it replaces manual JSON file operations with a centralized service using Pydantic schemas for validation and enum-based type safety.

Key Benefits:

Type Safety: Pydantic models validate all cached content
Enum-Based Validation: ContentType and RetrievalLevel enums prevent string typos
Automatic File Management: Professional naming conventions and directory organization
Level-Based Retrieval: Flexible detail levels (summary/methods/samples/platform/full)
Workspace Integration: Seamless integration with DataManagerV2 and research_agent tools

Two-Tier Architecture:

research_agent tools (write_to_workspace, get_content_from_workspace)
                    ↓
       WorkspaceContentService (validation, file I/O)
                    ↓
          DataManagerV2 workspace directory
                    ↓
         literature/  data/  metadata/  exports/  (JSON/CSV files)

Architecture

Content Types (Enum)

from lobster.tools.workspace_content_service import ContentType

class ContentType(str, Enum):
    PUBLICATION = "publication"      # Research papers (PubMed, PMC, bioRxiv)
    DATASET = "dataset"              # GEO, SRA, PRIDE datasets
    METADATA = "metadata"            # Sample mappings, validation results, QC reports
    EXPORTS = "exports"              # Analysis results and data exports
    DOWNLOAD_QUEUE = "download_queue"    # Download queue entries (JSONL)
    PUBLICATION_QUEUE = "publication_queue"  # Publication queue entries (JSONL)

Workspace Directory Mapping:

ContentType.PUBLICATION → workspace/literature/*.json
ContentType.DATASET → workspace/data/*.json
ContentType.METADATA → workspace/metadata/*.json
ContentType.EXPORTS → workspace/exports/*.*
ContentType.DOWNLOAD_QUEUE → workspace/.lobster/queues/download_queue.jsonl
ContentType.PUBLICATION_QUEUE → workspace/.lobster/queues/publication_queue.jsonl

Retrieval Levels (Enum)

from lobster.tools.workspace_content_service import RetrievalLevel

class RetrievalLevel(str, Enum):
    SUMMARY = "summary"      # Key-value overview (title, authors, sample count)
    METHODS = "methods"      # Methods section (publications only)
    SAMPLES = "samples"      # Sample IDs and metadata (datasets only)
    PLATFORM = "platform"    # Platform/technology info (datasets only)
    FULL = "full"            # All available content

Level-Specific Fields:

Content Type	Summary	Methods	Samples	Platform	Full
Publication	identifier, title, authors, journal, year, keywords	identifier, title, methods	N/A	N/A	All fields
Dataset	identifier, title, sample_count, organism	N/A	identifier, sample_count, samples	identifier, platform, platform_id	All fields
Metadata	identifier, content_type, description, related_datasets	N/A	N/A	N/A	All fields

Pydantic Content Schemas

PublicationContent

from lobster.tools.workspace_content_service import PublicationContent

pub = PublicationContent(
    identifier="PMID:35042229",
    title="Single-cell RNA-seq reveals...",
    authors=["Smith J", "Jones A"],
    journal="Nature",
    year=2022,
    abstract="We performed single-cell RNA-seq...",
    methods="Cells were processed using 10X Chromium...",
    full_text="...",  # Complete paper text
    keywords=["single-cell", "RNA-seq", "cancer"],
    source="PMC",  # PMC, PubMed, bioRxiv
    cached_at="2025-01-12T10:30:00",  # ISO 8601 timestamp
    url="https://pubmed.ncbi.nlm.nih.gov/35042229/"
)

Fields:

identifier (required): PMID, DOI, or bioRxiv ID
title, authors, journal, year: Bibliographic metadata
abstract, methods, full_text: Content sections
keywords: Publication keywords (MeSH terms, author keywords)
source (required): Provider (PMC, PubMed, bioRxiv, medRxiv)
cached_at (required): ISO 8601 timestamp
url: Publication URL

DatasetContent

from lobster.tools.workspace_content_service import DatasetContent

dataset = DatasetContent(
    identifier="GSE123456",
    title="Single-cell RNA-seq of aging brain",
    platform="Illumina NovaSeq 6000",
    platform_id="GPL24676",
    organism="Homo sapiens",
    sample_count=12,
    samples={
        "GSM1": {"age": 25, "tissue": "brain"},
        "GSM2": {"age": 65, "tissue": "brain"}
    },
    experimental_design="Age comparison: young (n=6) vs old (n=6)",
    summary="Dataset comparing transcriptional changes...",
    pubmed_ids=["35042229"],
    source="GEO",  # GEO, SRA, PRIDE
    cached_at="2025-01-12T10:30:00",
    url="https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE123456"
)

Fields:

identifier (required): GSE, SRA, PRIDE accession
title, summary: Dataset descriptions
platform, platform_id: Technology information
organism: Species (e.g., Homo sapiens, Mus musculus)
sample_count (required): Number of samples (≥0)
samples: Dictionary mapping sample IDs to metadata
experimental_design: Study design description
pubmed_ids: Associated publications
source (required): Repository (GEO, SRA, PRIDE)
cached_at (required): ISO 8601 timestamp
url: Dataset URL

MetadataContent

from lobster.tools.workspace_content_service import MetadataContent

metadata = MetadataContent(
    identifier="gse12345_to_gse67890_mapping",
    content_type="sample_mapping",
    description="Sample ID mapping between two datasets",
    data={
        "exact_matches": 10,
        "fuzzy_matches": 5,
        "unmapped": 2,
        "mapping_rate": 0.88
    },
    related_datasets=["GSE12345", "GSE67890"],
    source="SampleMappingService",
    cached_at="2025-01-12T10:30:00"
)

Fields:

identifier (required): Unique metadata identifier
content_type (required): Type descriptor (sample_mapping, validation, qc_report, etc.)
description: Human-readable description
data (required): Arbitrary JSON-serializable content
related_datasets: Related dataset accessions
source (required): Tool or service name
cached_at (required): ISO 8601 timestamp

Service API

Initialization

from lobster.core.data_manager_v2 import DataManagerV2
from lobster.tools.workspace_content_service import WorkspaceContentService

data_manager = DataManagerV2(workspace_path="~/.lobster_workspace")
workspace_service = WorkspaceContentService(data_manager=data_manager)

Directory Structure Created:

workspace_path/
├── literature/    # Publications (PublicationContent)
├── data/          # Datasets (DatasetContent)
└── metadata/      # Metadata (MetadataContent)

Writing Content

from lobster.tools.workspace_content_service import (
    PublicationContent,
    ContentType,
    WorkspaceContentService
)
from datetime import datetime

# Create content model
pub_content = PublicationContent(
    identifier="PMID:35042229",
    title="Single-cell analysis of aging",
    authors=["Smith J", "Jones A"],
    journal="Nature",
    year=2022,
    abstract="Abstract text...",
    methods="Methods text...",
    source="PMC",
    cached_at=datetime.now().isoformat()
)

# Write to workspace
cache_path = workspace_service.write_content(
    content=pub_content,
    content_type=ContentType.PUBLICATION
)
# Returns: "/workspace/literature/pmid_35042229.json"

Naming Convention:

Identifier sanitized: lowercase, special characters → underscores
PMID:35042229 → pmid_35042229.json
GSE123456 → gse123456.json
DOI:10.1038/s41586-021-12345-6 → doi_10_1038_s41586_021_12345_6.json

Reading Content

Basic Retrieval

from lobster.tools.workspace_content_service import ContentType, RetrievalLevel

# Read full content
full_content = workspace_service.read_content(
    identifier="PMID:35042229",
    content_type=ContentType.PUBLICATION,
    level=RetrievalLevel.FULL
)
# Returns: Dict with all fields

# Read summary only
summary = workspace_service.read_content(
    identifier="PMID:35042229",
    content_type=ContentType.PUBLICATION,
    level=RetrievalLevel.SUMMARY
)
# Returns: Dict with identifier, title, authors, journal, year, keywords

# Read methods section
methods = workspace_service.read_content(
    identifier="PMID:35042229",
    content_type=ContentType.PUBLICATION,
    level=RetrievalLevel.METHODS
)
# Returns: Dict with identifier, title, methods

Dataset Retrieval Examples

# Get dataset summary
summary = workspace_service.read_content(
    identifier="GSE123456",
    content_type=ContentType.DATASET,
    level=RetrievalLevel.SUMMARY
)
# Returns: identifier, title, sample_count, organism

# Get sample metadata
samples = workspace_service.read_content(
    identifier="GSE123456",
    content_type=ContentType.DATASET,
    level=RetrievalLevel.SAMPLES
)
# Returns: identifier, sample_count, samples, experimental_design

# Get platform information
platform = workspace_service.read_content(
    identifier="GSE123456",
    content_type=ContentType.DATASET,
    level=RetrievalLevel.PLATFORM
)
# Returns: identifier, platform, platform_id, organism

Listing Content

# List all cached content
all_content = workspace_service.list_content()
# Returns: List[Dict] with all publications, datasets, metadata

# List only publications
publications = workspace_service.list_content(
    content_type=ContentType.PUBLICATION
)
# Returns: List[Dict] with publication metadata

# List only datasets
datasets = workspace_service.list_content(
    content_type=ContentType.DATASET
)
# Returns: List[Dict] with dataset metadata

List Result Format:

[
    {
        "identifier": "PMID:35042229",
        "title": "Single-cell analysis...",
        "authors": ["Smith J", "Jones A"],
        "cached_at": "2025-01-12T10:30:00",
        "_content_type": "publication",  # Added by service
        "_file_path": "/workspace/literature/pmid_35042229.json"  # Added by service
    },
    # ... more items
]

Deleting Content

# Delete cached publication
deleted = workspace_service.delete_content(
    identifier="PMID:35042229",
    content_type=ContentType.PUBLICATION
)
# Returns: True if deleted, False if not found

Workspace Statistics

stats = workspace_service.get_workspace_stats()
# Returns:
# {
#     "total_items": 42,
#     "publications": 15,
#     "datasets": 20,
#     "metadata": 7,
#     "total_size_mb": 12.5,
#     "cache_dir": "/workspace/cache/content"
# }

Centralized Exports Directory (v1.0+)

As of version 1.0, all user-facing data exports (CSV, TSV, Excel) are written to a centralized exports directory for easy discovery.

Directory Structure:

workspace_path/
├── literature/    # Publications (PublicationContent)
├── data/          # Datasets (DatasetContent)
├── metadata/      # Metadata (MetadataContent)
└── exports/       # 🆕 User-facing CSV/TSV/Excel exports (v1.0+)

Why Centralized Exports?

Single Location: Customers know exactly where to find exported files
Easy Discovery: No hunting across multiple subdirectories
Clean Organization: Separates cached JSON (metadata/) from final outputs (exports/)
Predictable: All tools write to same location

Getting Exports Directory:

exports_dir = workspace_service.get_exports_directory(create=True)
# Returns: Path("workspace_path/exports")

Listing Export Files:

# List all exports
files = workspace_service.list_export_files()
# Returns: [
#     {
#         "name": "aggregated_samples.csv",
#         "path": Path("workspace_path/exports/aggregated_samples.csv"),
#         "size": 1024567,
#         "modified": "2025-01-12T14:30:00",
#         "category": "metadata"  # metadata, results, plots, custom
#     },
#     ...
# ]

# Filter by pattern
csv_files = workspace_service.list_export_files(pattern="*.csv")

# Filter by category
metadata_exports = workspace_service.list_export_files(category="metadata")

File Categorization: Files are automatically categorized based on naming conventions:

metadata_* → "metadata" (sample tables, mappings)
results_* → "results" (analysis outputs)
plot_* → "plots" (visualizations)
Other → "custom"

Usage in Custom Code:

# In execute_custom_code, OUTPUT_DIR variable is pre-configured
df.to_csv(OUTPUT_DIR / "my_results.csv")  # Saves to workspace/exports/

Unified Metadata View: The /metadata CLI command now shows exports alongside other sources:

sources = workspace_service.get_all_metadata_sources()
# Returns: {
#     "in_memory": [...],           # metadata_store entries
#     "workspace_files": [...],      # workspace/metadata/*.json
#     "exports": [...],              # workspace/exports/*.csv
#     "deprecated": [...]            # workspace/metadata/exports/*.csv (old location)
# }

Deprecation Warning: The old workspace/metadata/exports/ location is deprecated. A warning is shown if files exist there:

⚠️ Found 3 files in deprecated location: workspace/metadata/exports/
New exports go to workspace/exports/. Consider migrating:
    mv workspace/metadata/exports/* workspace/exports/

Integration with research_agent Tools

The research_agent provides two tools that use WorkspaceContentService under the hood:

write_to_workspace Tool

Purpose: Cache research content for persistent access and specialist handoff.

Usage Pattern:

# In research_agent tool
from lobster.tools.workspace_content_service import (
    ContentType,
    PublicationContent,
    WorkspaceContentService
)

@tool
def write_to_workspace(identifier: str, workspace: str, content_type: str = None) -> str:
    # 1. Initialize service
    workspace_service = WorkspaceContentService(data_manager=data_manager)

    # 2. Map workspace categories to ContentType enum
    workspace_to_content_type = {
        "literature": ContentType.PUBLICATION,
        "data": ContentType.DATASET,
        "metadata": ContentType.METADATA,
    }

    # 3. Validate workspace category
    if workspace not in workspace_to_content_type:
        return f"Error: Invalid workspace '{workspace}'"

    # 4. Retrieve content from data_manager
    if identifier in data_manager.metadata_store:
        content_data = data_manager.metadata_store[identifier]
    elif identifier in data_manager.list_modalities():
        adata = data_manager.get_modality(identifier)
        content_data = {...}  # Extract metadata
    else:
        return f"Error: Identifier '{identifier}' not found"

    # 5. Create Pydantic model
    content_model = PublicationContent(
        identifier=identifier,
        # ... populate fields
        cached_at=datetime.now().isoformat()
    )

    # 6. Write using service
    cache_path = workspace_service.write_content(
        content=content_model,
        content_type=workspace_to_content_type[workspace]
    )

    return f"Cached to {cache_path}"

Naming Conventions:

Publications: publication_PMID12345 or publication_DOI...
Datasets: dataset_GSE12345
Metadata: metadata_GSE12345_samples

Example:

# Cache publication after reading
> "I just read PMID:35042229. Please cache it for later."
→ write_to_workspace("publication_PMID35042229", workspace="literature", content_type="publication")

# Cache dataset metadata
> "Cache GSE123456 metadata for validation."
→ write_to_workspace("dataset_GSE123456", workspace="data", content_type="dataset")

get_content_from_workspace Tool

Purpose: Retrieve cached research content with flexible detail levels.

Unified Architecture (v2.6+)

As of version 2.6, get_content_from_workspace uses a unified adapter-based architecture that provides consistent behavior across all workspace types.

Key Improvements:

Consistent API: All workspaces support the same operations (list, filter, retrieve)
Unified Formatting: Status emojis, titles, and details formatted consistently
Type Safety: Internal WorkspaceItem TypedDict ensures defensive field access
Error Handling: No more KeyError crashes on missing fields

Architecture Diagram:

User Query → Dispatcher → Adapter → WorkspaceItem[] → Formatter → Markdown
                 ↓           ↓             ↓              ↓
            5 workspaces  Normalize   Unified      Consistent
                         data types   structure     output

Adapters:

_adapt_general_content() - literature, data, metadata workspaces
_adapt_download_queue() - download queue entries
_adapt_publication_queue() - publication queue entries

WorkspaceItem Structure:

class WorkspaceItem(TypedDict, total=False):
    identifier: str          # Primary ID
    workspace: str           # Category
    type: str                # Item type
    status: Optional[str]    # For queues
    priority: Optional[int]  # For queues
    title: Optional[str]     # Display title
    cached_at: Optional[str] # ISO timestamp
    details: Optional[str]   # Summary/metadata

Benefits:

Agents can use same mental model for all workspaces
No workspace-specific error handling needed
Easy to add new workspace types (one adapter function)
Backward compatible (same output format)

Usage Pattern (Simplified)

@tool
def get_content_from_workspace(
    identifier: str = None,
    workspace: str = None,
    level: str = "summary"
) -> str:
    # 1. Initialize service
    workspace_service = WorkspaceContentService(data_manager=data_manager)

    # 2. Map strings to enums
    workspace_to_content_type = {
        "literature": ContentType.PUBLICATION,
        "data": ContentType.DATASET,
        "metadata": ContentType.METADATA,
    }

    level_to_retrieval = {
        "summary": RetrievalLevel.SUMMARY,
        "methods": RetrievalLevel.METHODS,
        "samples": RetrievalLevel.SAMPLES,
        "platform": RetrievalLevel.PLATFORM,
        "metadata": RetrievalLevel.FULL,
    }

    # 3. List mode (no identifier)
    if identifier is None:
        content_type_filter = workspace_to_content_type[workspace] if workspace else None
        all_cached = workspace_service.list_content(content_type=content_type_filter)
        return format_list_response(all_cached)

    # 4. Retrieve mode (with identifier)
    retrieval_level = level_to_retrieval[level]

    # Try each content type if workspace not specified
    content_types_to_try = (
        [workspace_to_content_type[workspace]] if workspace
        else list(ContentType)
    )

    for content_type in content_types_to_try:
        try:
            cached_content = workspace_service.read_content(
                identifier=identifier,
                content_type=content_type,
                level=retrieval_level
            )
            return format_response(cached_content, level)
        except FileNotFoundError:
            continue

    return f"Error: Identifier '{identifier}' not found"

Examples:

# List all cached content
> "What content do I have cached?"
→ get_content_from_workspace()

# List publications only
> "Show me cached publications."
→ get_content_from_workspace(workspace="literature")

# Get publication methods section
> "Show methods from PMID:35042229."
→ get_content_from_workspace(
    identifier="publication_PMID35042229",
    workspace="literature",
    level="methods"
)

# Get dataset samples
> "Show sample IDs for GSE123456."
→ get_content_from_workspace(
    identifier="dataset_GSE123456",
    workspace="data",
    level="samples"
)

# Get full metadata
> "Show full metadata for my sample mapping."
→ get_content_from_workspace(
    identifier="metadata_gse12345_to_gse67890_mapping",
    workspace="metadata",
    level="metadata"
)

Common Workflows

Workflow 1: Cache Publication for Later Analysis

# 1. Search literature
search_literature("BRCA1 breast cancer", max_results=5)

# 2. Read full publication
read_full_publication("PMID:35042229")
# → Content automatically cached in metadata_store

# 3. Cache to workspace
write_to_workspace(
    identifier="publication_PMID35042229",
    workspace="literature",
    content_type="publication"
)

# 4. Later: Retrieve methods section
get_content_from_workspace(
    identifier="publication_PMID35042229",
    workspace="literature",
    level="methods"
)

Workflow 2: Cache Dataset Before Handoff to Specialist

# 1. Discover dataset
find_related_entries("PMID:35042229", entry_type="dataset")
# → Found: GSE123456

# 2. Get dataset metadata
get_dataset_metadata("GSE123456")
# → Metadata stored in metadata_store

# 3. Cache to workspace before handoff
write_to_workspace(
    identifier="dataset_GSE123456",
    workspace="data",
    content_type="dataset"
)

# 4. Hand off to metadata_assistant
handoff_to_metadata_assistant(
    instructions="Validate GSE123456 for treatment_response field. "
                "Dataset cached in data workspace."
)

Workflow 3: Multiple Detail Levels

# Start with summary
get_content_from_workspace(
    identifier="dataset_GSE123456",
    workspace="data",
    level="summary"
)
# → Returns: title, sample_count, organism

# Need more details? Get samples
get_content_from_workspace(
    identifier="dataset_GSE123456",
    workspace="data",
    level="samples"
)
# → Returns: sample IDs and metadata

# Need platform info?
get_content_from_workspace(
    identifier="dataset_GSE123456",
    workspace="data",
    level="platform"
)
# → Returns: platform, platform_id, organism

# Need everything?
get_content_from_workspace(
    identifier="dataset_GSE123456",
    workspace="data",
    level="metadata"
)
# → Returns: all fields

Best Practices

Naming Conventions

Follow Professional Naming:

Lowercase identifiers
Underscores for separators
Descriptive prefixes

# ✅ Good
"publication_PMID35042229"
"dataset_GSE123456"
"metadata_gse12345_to_gse67890_mapping"

# ❌ Bad
"PMID:35042229"  # Contains colon
"GSE 123456"      # Contains space
"mapping-12345"   # Ambiguous prefix

Content Validation

Always Use Pydantic Models:

# ✅ Good - Validation enforced
pub_content = PublicationContent(
    identifier="PMID:35042229",
    source="PMC",
    cached_at=datetime.now().isoformat()
)
workspace_service.write_content(pub_content, ContentType.PUBLICATION)

# ❌ Bad - No validation
raw_dict = {"identifier": "PMID:35042229"}  # Missing required fields
# Will fail validation

Error Handling

Handle FileNotFoundError:

from lobster.tools.workspace_content_service import ContentType, RetrievalLevel

try:
    content = workspace_service.read_content(
        identifier="publication_PMID12345",
        content_type=ContentType.PUBLICATION,
        level=RetrievalLevel.SUMMARY
    )
except FileNotFoundError as e:
    logger.warning(f"Content not found: {e}")
    # List available content
    available = workspace_service.list_content(ContentType.PUBLICATION)
    logger.info(f"Available publications: {[c['identifier'] for c in available]}")

Level Selection

Choose Appropriate Detail Level:

Use Case	Recommended Level	Why
Quick overview	`SUMMARY`	Fast, minimal data transfer
Replication protocol	`METHODS`	Focused on procedures
Sample alignment	`SAMPLES`	Just sample metadata
Platform validation	`PLATFORM`	Technology compatibility check
Full export	`FULL`	Complete content for archival

Workspace Organization

Categorize Content by Type:

# Literature review project
workspace_service.write_content(pub1, ContentType.PUBLICATION)  # → literature/
workspace_service.write_content(pub2, ContentType.PUBLICATION)  # → literature/

# Dataset analysis project
workspace_service.write_content(dataset1, ContentType.DATASET)  # → data/
workspace_service.write_content(dataset2, ContentType.DATASET)  # → data/

# Metadata operations
workspace_service.write_content(mapping, ContentType.METADATA)  # → metadata/

Backward Compatibility

Maintain Tool Signatures:

Both tools (write_to_workspace, get_content_from_workspace) maintain original signatures
String-based parameters at tool level
Enum conversion happens internally
Same response formats as before refactoring

Performance Considerations

Caching Strategy

When to Cache:

✅ After expensive operations (PDF parsing, full-text extraction)
✅ Before handing off to other agents (context preservation)
✅ When content will be reused (literature reviews, multi-step workflows)

When NOT to Cache:

❌ Temporary scratch data
❌ Duplicates of in-memory modalities
❌ Large binary files (use modalities storage instead)

File Size Management

Monitor Workspace Size:

stats = workspace_service.get_workspace_stats()
if stats["total_size_mb"] > 100:
    logger.warning("Workspace size exceeding 100MB")
    # Consider cleaning old cached content

Delete Old Content:

# Remove cached content no longer needed
workspace_service.delete_content(
    identifier="old_publication_PMID12345",
    content_type=ContentType.PUBLICATION
)

Troubleshooting

Common Issues

Issue: "File has been modified since read"

Cause: Auto-formatter/linter running between Read and Edit
Solution: Read larger context window (400+ lines) before editing

Issue: "Invalid workspace 'xyz'"

Cause: Typo in workspace parameter
Solution: Use enum mapping: "literature", "data", or "metadata"

Issue: "Invalid detail level 'abc'"

Cause: Unsupported level string
Solution: Use valid levels: "summary", "methods", "samples", "platform", "metadata"

Issue: "ValidationError: Field required"

Cause: Missing required Pydantic fields
Solution: Check schema requirements (identifier, source, cached_at)

Debugging

Enable Debug Logging:

import logging
logging.basicConfig(level=logging.DEBUG)

# Service operations will log:
# - File paths created
# - Content validated
# - Errors encountered

Inspect Workspace Contents:

# Check cached files
ls -lh ~/.lobster_workspace/literature/
ls -lh ~/.lobster_workspace/data/
ls -lh ~/.lobster_workspace/metadata/

# View JSON content
cat ~/.lobster_workspace/literature/pmid_35042229.json | jq .

Migration from Manual JSON Handling

Before (Manual Implementation)

# Old approach - manual file operations
import json
from pathlib import Path

cache_dir = Path(workspace_path) / "literature"
cache_file = cache_dir / f"{identifier.lower()}.json"

# Write
with open(cache_file, "w") as f:
    json.dump({"identifier": identifier, ...}, f)

# Read
with open(cache_file, "r") as f:
    content = json.load(f)

# List
cached_files = list(cache_dir.glob("*.json"))

After (WorkspaceContentService)

# New approach - service-based
from lobster.tools.workspace_content_service import (
    WorkspaceContentService,
    PublicationContent,
    ContentType,
    RetrievalLevel
)

workspace_service = WorkspaceContentService(data_manager=data_manager)

# Write
pub_content = PublicationContent(identifier=identifier, ...)
workspace_service.write_content(pub_content, ContentType.PUBLICATION)

# Read
content = workspace_service.read_content(
    identifier, ContentType.PUBLICATION, RetrievalLevel.SUMMARY
)

# List
cached_list = workspace_service.list_content(ContentType.PUBLICATION)

Benefits:

✅ Pydantic validation (catch errors early)
✅ Enum type safety (no string typos)
✅ Automatic directory management
✅ Level-based filtering (no manual if/elif chains)
✅ Professional naming (automatic sanitization)

Version History

Version	Changes
v0.2+	Initial implementation with Pydantic schemas, enum-based validation, two-tier architecture

38 workspace content service - the-omics-os/lobster-local GitHub Wiki

Workspace Content Service

Overview

Architecture

Content Types (Enum)

Retrieval Levels (Enum)

Pydantic Content Schemas

PublicationContent

DatasetContent

MetadataContent

Service API

Initialization

Writing Content

Reading Content

Basic Retrieval

Dataset Retrieval Examples

Listing Content

Deleting Content

Workspace Statistics

Centralized Exports Directory (v1.0+)

Integration with research_agent Tools

write_to_workspace Tool

get_content_from_workspace Tool

Unified Architecture (v2.6+)

Usage Pattern (Simplified)

Common Workflows

Workflow 1: Cache Publication for Later Analysis

Workflow 2: Cache Dataset Before Handoff to Specialist

Workflow 3: Multiple Detail Levels

Best Practices

Naming Conventions

Content Validation

Error Handling

Level Selection

Workspace Organization

Backward Compatibility

Performance Considerations

Caching Strategy

File Size Management

Troubleshooting

Common Issues

Debugging

Migration from Manual JSON Handling

Before (Manual Implementation)

After (WorkspaceContentService)

Version History

Related Documentation

See Also