37 publication intelligence deep dive - the-omics-os/lobster-local GitHub Wiki

Publication Content Access & Provider Architecture

Version: 2.4.0+ (Phase 1-6 Refactoring Complete) Status: Production-ready Implementation: ContentAccessService with Provider Infrastructure (January 2025)

Overview

The ContentAccessService provides intelligent publication and dataset access through a capability-based provider architecture. This system replaced the legacy PublicationService and UnifiedContentService, delivering modular provider infrastructure, three-tier content cascade, and comprehensive literature mining capabilities.

What Changed?

Before (UnifiedContentService - Phase 3, Archived):

❌ Direct provider delegation without capability routing
❌ Manual provider selection logic in service code
❌ Limited to 3 providers (Abstract, PMC, Webpage)
❌ No dataset discovery capabilities
❌ No validation or metadata extraction tools

After (ContentAccessService - Phase 2+, Current):

✅ Provider Registry: Capability-based routing with priority system
✅ 5 Specialized Providers: Abstract, PubMed, GEO, PMC, Webpage (with Docling)
✅ 10 Core Methods: Discovery (3), Metadata (2), Content (3), System (1), Validation (1)
✅ Three-Tier Cascade: PMC XML → Webpage → PDF with automatic fallback
✅ Dataset Integration: GEO/SRA/PRIDE dataset discovery and validation
✅ Session Caching: DataManager-first with W3C-PROV provenance

Performance Impact

Metric	UnifiedContentService	ContentAccessService	Improvement
Abstract Retrieval	200-500ms (AbstractProvider)	200-500ms (AbstractProvider)	Same (optimized path)
PMC Full-Text	500ms-2s (PMCProvider)	500ms-2s (PMCProvider priority)	Same (10x faster than HTML)
Dataset Discovery	N/A (not available)	2-5s (GEOProvider)	New capability
Literature Search	N/A (not available)	1-3s (PubMedProvider)	New capability
Provider Selection	Manual logic	Automatic routing	Better maintainability
Extensibility	Hard-coded providers	Registry-based	Easy to add providers

Architecture

Capability-Based Provider System

┌─────────────────────────────────────────────────────────────┐
│                    ContentAccessService                     │
│                   (Coordination Layer)                      │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  10 Core Methods:                                           │
│  ┌───────────────────────────────────────────────────┐     │
│  │ Discovery (3):                                    │     │
│  │  - search_literature                              │     │
│  │  - discover_datasets                              │     │
│  │  - find_linked_datasets                           │     │
│  │                                                    │     │
│  │ Metadata (2):                                     │     │
│  │  - extract_metadata                               │     │
│  │  - validate_metadata                              │     │
│  │                                                    │     │
│  │ Content (3):                                      │     │
│  │  - get_abstract                                   │     │
│  │  - get_full_content                               │     │
│  │  - extract_methods                                │     │
│  │                                                    │     │
│  │ System (1):                                       │     │
│  │  - query_capabilities                             │     │
│  └───────────────────────────────────────────────────┘     │
│                         ↓                                   │
│                  ProviderRegistry                           │
│              (Capability-Based Routing)                     │
└─────────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────────┐
│                    Provider Layer                           │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  Provider 1: AbstractProvider (Priority: 10)               │
│  └─ Capability: GET_ABSTRACT                               │
│     Performance: 200-500ms                                  │
│                                                             │
│  Provider 2: PubMedProvider (Priority: 10)                 │
│  └─ Capabilities: SEARCH_LITERATURE, FIND_LINKED_DATASETS, │
│                   EXTRACT_METADATA                          │
│     Performance: 1-3s                                       │
│                                                             │
│  Provider 3: GEOProvider (Priority: 10)                    │
│  └─ Capabilities: DISCOVER_DATASETS, EXTRACT_METADATA,     │
│                   VALIDATE_METADATA                         │
│     Performance: 2-5s                                       │
│                                                             │
│  Provider 4: PMCProvider (Priority: 10)                    │
│  └─ Capability: GET_FULL_CONTENT (PMC XML API)            │
│     Performance: 500ms-2s (PRIORITY PATH)                  │
│                                                             │
│  Provider 5: WebpageProvider (Priority: 50)                │
│  └─ Capabilities: GET_FULL_CONTENT (Webpage + PDF)        │
│     Performance: 2-8s (FALLBACK)                           │
│     Uses: DoclingService (internal composition)            │
│                                                             │
└─────────────────────────────────────────────────────────────┘
                         ↓
┌─────────────────────────────────────────────────────────────┐
│                   DataManagerV2                             │
│              (Session Caching + Provenance)                 │
└─────────────────────────────────────────────────────────────┘

System Design

User → research_agent (10 tools)
           ↓
    ContentAccessService (10 methods)
           ↓
    ProviderRegistry (capability routing)
           ↓
    ┌──────┴───────────────────┐
    ↓         ↓         ↓       ↓         ↓
Abstract  PubMed    GEO     PMC    Webpage
Provider  Provider  Provider Provider Provider
    ↓         ↓         ↓       ↓         ↓
 NCBI     PubMed   GEO API  PMC XML  Docling
E-utils    API              API      Service
                                        ↓
                                  (Webpage + PDF)

Key Components

1. ContentAccessService (Coordination Layer)

Location: lobster/tools/content_access_service.py

Responsibilities:

Method routing to appropriate providers via ProviderRegistry
Capability-based provider selection
DataManager-first caching coordination
Error handling and fallback orchestration
W3C-PROV provenance tracking
Lightweight IR (Intermediate Representation) for non-exportable research operations

Public API (10 Methods):

Discovery (3 methods):

def search_literature(
    self,
    query: str,
    max_results: int = 5,
    sources: Optional[list[str]] = None,
    filters: Optional[dict[str, any]] = None
) -> Tuple[str, Dict[str, Any], AnalysisStep]:
    """Search PubMed, bioRxiv, medRxiv for literature."""

def discover_datasets(
    self,
    query: str,
    dataset_type: "DatasetType",
    max_results: int = 5,
    filters: Optional[dict[str, str]] = None
) -> Tuple[str, Dict[str, Any], AnalysisStep]:
    """Search GEO, SRA, PRIDE for omics datasets."""

def find_linked_datasets(
    self,
    identifier: str,
    dataset_types: Optional[list["DatasetType"]] = None,
    include_related: bool = True
) -> str:
    """Find datasets linked to a publication."""

Metadata (2 methods):

def extract_metadata(
    self,
    identifier: str,
    source: Optional[str] = None
) -> Union["PublicationMetadata", str]:
    """Extract publication/dataset metadata."""

def validate_metadata(
    self,
    dataset_id: str,
    required_fields: Optional[List[str]] = None,
    required_values: Optional[Dict[str, List[str]]] = None,
    threshold: float = 0.8
) -> str:
    """Validate dataset metadata completeness."""

Content (3 methods):

def get_abstract(
    self,
    identifier: str,
    force_refresh: bool = False
) -> dict[str, any]:
    """Tier 1: Fast abstract retrieval (200-500ms)."""

def get_full_content(
    self,
    source: str,
    prefer_webpage: bool = True,
    keywords: Optional[list[str]] = None,
    max_paragraphs: int = 100,
    max_retries: int = 2
) -> dict[str, any]:
    """Tier 2: Full content with PMC-first cascade."""

def extract_methods(
    self,
    content_result: dict[str, any],
    llm: Optional[any] = None,
    include_tables: bool = True
) -> dict[str, any]:
    """Extract structured methods from content."""

System (1 method):

def query_capabilities(self) -> str:
    """Query available providers and capabilities."""

2. ProviderRegistry (Routing Layer)

Location: lobster/tools/providers/provider_registry.py

Responsibilities:

Provider registration and lifecycle management
Capability-based routing to best-fit provider
Priority-based provider ordering
Dataset type mapping to providers
Capability matrix generation for debugging

Key Methods:

def register_provider(self, provider: BaseProvider) -> None:
    """Register a provider with its capabilities."""

def get_providers_for_capability(
    self,
    capability: ProviderCapability
) -> List[BaseProvider]:
    """Get all providers supporting a capability (sorted by priority)."""

def get_provider_for_dataset_type(
    self,
    dataset_type: DatasetType
) -> Optional[BaseProvider]:
    """Get provider for specific dataset type."""

def get_capability_matrix(self) -> str:
    """Generate debug matrix of providers and capabilities."""

3. Provider Layer (Specialized Data Access)

Provider Architecture:

# Base provider interface
class BaseProvider(ABC):
    name: str
    priority: int  # Lower = higher priority (10 = high, 50 = low)
    capabilities: Set[ProviderCapability]
    supported_dataset_types: Set[DatasetType]

    @abstractmethod
    def search_publications(
        self,
        query: str,
        max_results: int = 5,
        filters: Optional[dict] = None
    ) -> str:
        """Search for publications/datasets."""

5 Registered Providers:

Provider	Priority	Capabilities	Performance	Coverage
AbstractProvider	10 (high)	GET_ABSTRACT	200-500ms	All PubMed
PubMedProvider	10 (high)	SEARCH_LITERATURE, FIND_LINKED_DATASETS, EXTRACT_METADATA	1-3s	All PubMed indexed
GEOProvider	10 (high)	DISCOVER_DATASETS, EXTRACT_METADATA, VALIDATE_METADATA	2-5s	All GEO/SRA datasets
PMCProvider	10 (high)	GET_FULL_CONTENT	500ms-2s	30-40% (NIH-funded + open access)
WebpageProvider	50 (low)	GET_FULL_CONTENT	2-8s	Major publishers + PDFs

Provider Details:

AbstractProvider (Fast Path):

# Location: lobster/tools/providers/abstract_provider.py
class AbstractProvider(BaseProvider):
    """Fast abstract retrieval via NCBI E-utilities."""

    capabilities = {ProviderCapability.GET_ABSTRACT}
    priority = 10  # High priority (fast)

    def get_abstract(self, identifier: str) -> PublicationMetadata:
        """Retrieve abstract metadata without full-text download."""

PubMedProvider (Literature & Linking):

# Location: lobster/tools/providers/pubmed_provider.py
class PubMedProvider(BaseProvider):
    """PubMed literature search and dataset linking."""

    capabilities = {
        ProviderCapability.SEARCH_LITERATURE,
        ProviderCapability.FIND_LINKED_DATASETS,
        ProviderCapability.EXTRACT_METADATA,
    }
    priority = 10

    def search_publications(self, query: str, **kwargs) -> str:
        """Search PubMed with E-utilities."""

    def find_datasets_from_publication(self, identifier: str) -> str:
        """Find GEO/SRA datasets linked via PubMed."""

GEOProvider (Dataset Discovery):

# Location: lobster/tools/providers/geo_provider.py
class GEOProvider(BaseProvider):
    """GEO dataset discovery and validation."""

    capabilities = {
        ProviderCapability.DISCOVER_DATASETS,
        ProviderCapability.EXTRACT_METADATA,
        ProviderCapability.VALIDATE_METADATA,
    }
    supported_dataset_types = {DatasetType.GEO}
    priority = 10

    def search_publications(self, query: str, **kwargs) -> str:
        """Search GEO datasets."""

    def search_by_accession(
        self,
        accession: str,
        include_parent_series: bool = False
    ) -> str:
        """Direct accession lookup with enhanced GSM handling."""

PMCProvider (Priority Full-Text):

# Location: lobster/tools/providers/pmc_provider.py
class PMCProvider(BaseProvider):
    """PMC full-text extraction via XML API (PRIORITY PATH)."""

    capabilities = {ProviderCapability.GET_FULL_CONTENT}
    priority = 10  # High priority (10x faster than webpage)

    def extract_full_text(self, identifier: str) -> PMCFullTextResult:
        """
        Extract full-text from PMC XML with semantic tags.

        Benefits:
        - 10x faster (500ms vs 2-5s HTML scraping)
        - 95% accuracy for methods extraction
        - 100% table parsing success
        - Structured sections with <sec sec-type=\"methods\">
        - 30-40% coverage (NIH-funded + open access)
        """

WebpageProvider (Fallback Path):

# Location: lobster/tools/providers/webpage_provider.py
class WebpageProvider(BaseProvider):
    """Webpage scraping and PDF extraction (FALLBACK)."""

    capabilities = {ProviderCapability.GET_FULL_CONTENT}
    priority = 50  # Low priority (slower fallback)

    def __init__(self, data_manager: DataManagerV2):
        self.docling_service = DoclingService(data_manager)  # Composition

    def extract_content(
        self,
        url: str,
        keywords: Optional[List[str]] = None,
        max_paragraphs: int = 100
    ) -> dict:
        """
        Extract content via webpage or PDF (uses DoclingService).

        Automatically detects format and routes to appropriate parser.
        """

DoclingService (Internal, Not Registered):

Used internally by WebpageProvider via composition
Not registered as separate provider
Handles both webpage HTML and PDF parsing
Structure-aware parsing with table extraction

Three-Tier Content Cascade

The system implements intelligent fallback for full-text retrieval:

Cascade Flow

User Request: get_full_content("PMID:35042229")
    ↓
Step 1: Check DataManager cache
    ├─ Cache hit? → Return immediately (<100ms)
    └─ Cache miss → Continue to Tier 1
    ↓
Tier 1: PMC XML API (Priority 10)
    ├─ Provider: PMCProvider
    ├─ Duration: 500ms-2s
    ├─ Coverage: 30-40% of biomedical literature
    ├─ Success? → Cache + Return ✅
    └─ PMCNotAvailableError → Continue to Tier 2
    ↓
Tier 2: Resolve to URL (if identifier)
    ├─ Use PublicationResolver
    ├─ PMID/DOI → Accessible URL
    ├─ Check accessibility
    └─ If paywalled → Return error with suggestions
    ↓
Tier 3: Webpage/PDF Extraction (Priority 50)
    ├─ Provider: WebpageProvider
    ├─ Auto-detect: Webpage HTML or PDF
    ├─ Duration: 2-8s
    ├─ Uses: DoclingService internally
    ├─ Success? → Cache + Return ✅
    └─ Failure → Return error

Performance Characteristics

Tier	Path	Duration	Success Rate	Coverage
Cache	DataManager lookup	<100ms	100% (if cached)	Previously accessed
Tier 1	PMC XML API	500ms-2s	95%	30-40% (open access)
Tier 2	URL Resolution	Variable	70-80%	Depends on accessibility
Tier 3	Webpage/PDF	2-8s	70%	Major publishers + preprints

Code Example

from lobster.tools.content_access_service import ContentAccessService

service = ContentAccessService(data_manager)

# Automatic three-tier cascade
content = service.get_full_content("PMID:35042229")

# Check which tier was used
print(f"Tier used: {content['tier_used']}")
# Possible values:
# - 'full_cached' (cache hit)
# - 'full_pmc_xml' (Tier 1: PMC)
# - 'full_webpage' (Tier 3: webpage HTML)
# - 'full_pdf' (Tier 3: PDF via Docling)

print(f"Source type: {content['source_type']}")
print(f"Extraction time: {content['extraction_time']:.2f}s")
print(f"Content length: {len(content['content'])} characters")

Method Categories & Usage

Discovery Methods (3)

search_literature()

Search PubMed, bioRxiv, medRxiv for publications.

Example:

results, stats, ir = service.search_literature(
    query="BRCA1 breast cancer",
    max_results=10,
    sources=["pubmed"],  # Optional: filter to specific sources
    filters={"publication_year": "2023"}  # Optional: date filters
)

print(f"Found {stats['results_count']} papers")
print(f"Provider: {stats['provider_used']}")  # PubMedProvider
print(f"Time: {stats['execution_time_ms']}ms")

discover_datasets()

Search for omics datasets with automatic accession detection.

Example:

# Direct accession (auto-detected)
results, stats, ir = service.discover_datasets(
    query="GSM6204600",  # GEO sample ID
    dataset_type=DatasetType.GEO
)

# Text search
results, stats, ir = service.discover_datasets(
    query="single-cell RNA-seq breast cancer",
    dataset_type=DatasetType.GEO,
    max_results=5
)

print(f"Found {stats['results_count']} datasets")
print(f"Accession detected: {stats.get('accession_detected', False)}")

find_linked_datasets()

Find datasets associated with a publication.

Example:

results = service.find_linked_datasets(
    identifier="PMID:35042229",
    dataset_types=[DatasetType.GEO, DatasetType.SRA]
)

print(results)  # Formatted list of linked datasets

Metadata Methods (2)

extract_metadata()

Extract publication or dataset metadata.

Example:

# Publication metadata
metadata = service.extract_metadata("PMID:35042229")

print(f"Title: {metadata.title}")
print(f"Authors: {metadata.authors}")
print(f"Abstract: {metadata.abstract[:200]}...")

# Dataset metadata
metadata = service.extract_metadata("GSE180759", source="geo")

validate_metadata()

Validate dataset metadata completeness before download.

Example:

report = service.validate_metadata(
    dataset_id="GSE180759",
    required_fields=["smoking_status", "treatment_response"],
    threshold=0.8  # 80% of samples must have fields
)

print(report)
# Formatted validation report with:
# - Completeness scores
# - Missing fields
# - Sample coverage
# - Recommendations (PROCEED/COHORT/SKIP)

Content Methods (3)

get_abstract()

Fast abstract retrieval (Tier 1: 200-500ms).

Example:

abstract = service.get_abstract("PMID:35042229")

print(f"Title: {abstract['title']}")
print(f"Authors: {abstract['authors']}")
print(f"Abstract: {abstract['abstract']}")
print(f"Keywords: {abstract['keywords']}")

get_full_content()

Full-text extraction with three-tier cascade.

Example:

# Automatic cascade: PMC → Webpage → PDF
content = service.get_full_content("PMID:35042229")

print(f"Tier used: {content['tier_used']}")
print(f"Methods section: {content.get('methods_text', 'N/A')[:200]}...")
print(f"Tables: {content['metadata']['tables']}")
print(f"Software detected: {content['metadata']['software']}")

extract_methods()

Extract structured methods from full content.

Example:

# Get full content first
content = service.get_full_content("PMID:35042229")

# Extract methods
methods = service.extract_methods(content, include_tables=True)

print(f"Software: {methods['software_used']}")
print(f"GitHub repos: {methods['github_repos']}")

System Methods (1)

query_capabilities()

Query available providers and their capabilities.

Example:

capabilities = service.query_capabilities()

print(capabilities)
# Returns formatted matrix showing:
# - Available operations
# - Registered providers
# - Supported dataset types
# - Performance tiers
# - Cascade logic

Integration with Research Agent

The research_agent uses ContentAccessService through 10 tools:

Tool Mapping

Agent Tool	ContentAccessService Method	Category
`search_literature`	`search_literature()`	Discovery
`fast_dataset_search`	`discover_datasets()`	Discovery
`find_related_entries`	`find_linked_datasets()`	Discovery
`get_dataset_metadata`	`extract_metadata()`	Metadata
`fast_abstract_search`	`get_abstract()`	Content
`read_full_publication`	`get_full_content()`	Content
`extract_methods`	`extract_methods()`	Content
`validate_dataset_metadata`	`validate_metadata()`	Metadata

Example Agent Workflow

# User: "Find breast cancer datasets with smoking status"

# Step 1: Literature search (PubMedProvider)
results, stats, ir = service.search_literature("breast cancer smoking")

# Step 2: Discover datasets (GEOProvider)
datasets, stats, ir = service.discover_datasets(
    "breast cancer",
    DatasetType.GEO,
    filters={"organism": "human"}
)

# Step 3: Validate metadata (GEOProvider)
report = service.validate_metadata(
    "GSE180759",
    required_fields=["smoking_status"]
)

# Step 4: Get full publication (PMC → Webpage → PDF cascade)
content = service.get_full_content("PMID:35042229")

# All operations tracked in W3C-PROV provenance

Performance Benchmarks

Benchmark Metadata:

Date Measured: 2025-01-15
Lobster Version: v0.2.0
Network: Residential broadband (100 Mbps)
Sample Size: 100 operations per provider
Test Conditions: Mixed cache hit/miss scenarios

Provider Performance

Provider	Operation	Mean Duration	P95	P99	Success Rate
AbstractProvider	`get_abstract()`	350ms	450ms	500ms	95%+
PubMedProvider	`search_literature()`	2.1s	3.5s	5s	99%+
GEOProvider	`discover_datasets()`	3.2s	4.8s	6s	95%+
PMCProvider	`get_full_content()`	1.2s	2s	2.5s	95% (of eligible)
WebpageProvider	`get_full_content()`	4.5s	7s	10s	70-80%

Note: Performance varies with network conditions and external API load. P95/P99 represent 95th and 99th percentile latencies.

Cascade Performance

Scenario	Tier Used	Duration	Frequency
Cache hit	Cache	<100ms	High (repeated access)
PMC available	Tier 1	500ms-2s	30-40% of requests
PMC unavailable	Tier 3	2-8s	60-70% of requests
Paywalled	Error	Variable	10-15% of requests

Optimization Strategies

DataManager-first caching - All operations check cache before API calls
Capability-based routing - Optimal provider selected automatically
Priority ordering - Fast providers tried first (Priority 10 before 50)
Graceful degradation - Automatic fallback on provider failures
Session persistence - Workspace caching for handoffs

DataManager-First Caching

All caching goes through DataManagerV2 (architectural requirement).

Cache Flow

Service Method Call
    ↓
1. Check DataManager cache
    ├─ Cache hit? → Return immediately
    └─ Cache miss → Continue
    ↓
2. Execute provider operation
    ├─ Success? → Store in DataManager + Return
    └─ Error? → Return error (no cache)
    ↓
3. DataManager stores:
    ├─ In-memory cache (session-scoped)
    ├─ Workspace filesystem (persistent)
    └─ W3C-PROV provenance log

Cache Methods

# ContentAccessService automatically caches all operations

# Cache publication content
data_manager.cache_publication_content(
    identifier="PMID:38448586",
    content=content_result,
    format="json"
)

# Retrieve cached content
cached = data_manager.get_cached_publication("PMID:38448586")

# Cache location
# ~/.lobster/literature_cache/{identifier}.json

Troubleshooting

Issue: "No providers available for capability"

Symptom:

ERROR: No available providers for literature search.

Cause: Provider not registered or capability not declared.

Solution:

# Check capability matrix
capabilities = service.query_capabilities()
print(capabilities)

# Verify provider registration
providers = service.registry.get_all_providers()
print(f"Registered providers: {len(providers)}")

Issue: PMC Full-Text Not Available

Symptom:

INFO: PMC full text not available for PMID:12345, falling back...

Cause: Paper not in PMC open access collection (70% of papers).

Expected: Automatic fallback to Tier 3 (Webpage/PDF).

Verification:

content = service.get_full_content("PMID:12345")
print(f"Tier used: {content['tier_used']}")  # Should be 'full_webpage' or 'full_pdf'

Issue: Dataset Validation Failed

Symptom:

WARNING: Dataset GSE12345 missing required metadata

Solution:

# Check validation report
report = service.validate_metadata(
    "GSE12345",
    required_fields=["condition", "sample_id"]
)
print(report)

# Review recommendations:
# - PROCEED: Full integration possible
# - COHORT: Cohort-level only
# - SKIP: Insufficient metadata

Best Practices

1. Use Capability-Based Routing

✅ GOOD: Let the registry route

# System automatically selects PubMedProvider
results, stats, ir = service.search_literature("BRCA1")

❌ BAD: Manual provider selection

# Don't access providers directly
provider = service.registry.get_provider_for_capability(...)

2. Leverage Three-Tier Cascade

✅ GOOD: Trust the cascade

# Automatically tries PMC → Webpage → PDF
content = service.get_full_content("PMID:35042229")

❌ BAD: Force specific tier

# Don't try to manually control cascade

3. Validate Before Download

✅ GOOD: Pre-download validation

# Check metadata first
report = service.validate_metadata("GSE180759", required_fields=["condition"])

if "PROCEED" in report:
    # Then download dataset
    pass

4. Check Capabilities

✅ GOOD: Query capabilities first

# Check what's available
capabilities = service.query_capabilities()
print(capabilities)

Version History

v0.2.0 (January 2025) - Phase 1-6 Complete:

✅ Phase 1: Provider infrastructure (5 providers)
✅ Phase 2: ContentAccessService consolidation (10 methods)
✅ Phase 3: metadata_assistant agent (4 tools)
✅ Phase 4: research_agent enhancements (10 tools)
✅ Phase 5: Multi-agent handoff patterns (3 workflows)
✅ Phase 6: Integration testing (127 tests, 3988 lines)
Added: ProviderRegistry with capability-based routing
Added: GEOProvider for dataset discovery
Added: Validation and metadata standardization
Enhanced: Three-tier cascade with PMC priority
Deprecated: UnifiedContentService (archived)
Deprecated: PublicationService (replaced)

v0.2.0 (January 2025) - Phase 3:

✅ UnifiedContentService (coordination layer)
✅ PMC-first access strategy
✅ DoclingService integration
✅ PublicationIntelligenceService deletion

v0.2.0 (November 2024):

Initial: PublicationIntelligenceService with Docling

References

ContentAccessService API: See 16-services-api.md
Provider Architecture: Source code in lobster/tools/providers/
Research Agent: See 15-agents-api.md
Metadata Assistant: Phase 3 documentation in code
Integration Tests: tests/integration/test_*_real_api.py (127 tests)

Next Steps:

Review 16-services-api.md for detailed API documentation
See 15-agents-api.md for Research Agent integration
Check 28-troubleshooting.md for common issues
Explore Phase 7 test suite for usage examples