Storage Manager - Low Level Design

Version: 1.0
Date: 2025-07-12
Status: Draft for Review

1. Introduction

This document provides the detailed low-level design for Globule's Storage Manager, the component responsible for persisting all data, generating semantic file paths, and enabling both traditional and AI-powered retrieval. The Storage Manager bridges the gap between conventional filesystem organization and modern semantic understanding, creating what we call a "semantic filesystem."

1.1 Scope

This LLD covers:

SQLite database schema and optimization strategies
Semantic path generation algorithms
Cross-platform filesystem handling
Transaction management and failure recovery
Search implementation (hybrid semantic + keyword)
Performance optimization and caching strategies

1.2 Dependencies from HLD

From the High Level Design document:

Local-first architecture with single-user focus for MVP
Support for 100-200 daily inputs (notes, photos, ideas)
Semantic filesystem that's human-navigable
Integration with Embedding and Parsing services
Future scalability to multi-user scenarios

2. Database Architecture

2.1 Technology Decision: SQLite for MVP

Decision: SQLite is selected as the MVP database.

Rationale:

Zero configuration requirement aligns with local-first philosophy
Sub-millisecond latency for single-user workload
Single file portability for backup and sync
Proven track record (Obsidian handles 10,000+ ops/second)
Write-Ahead Logging (WAL) provides adequate concurrency

Future Migration Path:

Data Access Layer (DAL) abstraction allows PostgreSQL swap
All SQL will use portable syntax where possible
Connection pooling interface ready for client-server model

2.2 Core Schema Design

-- Main content table
CREATE TABLE globules (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    uuid TEXT NOT NULL UNIQUE DEFAULT (lower(hex(randomblob(16)))),
    title TEXT NOT NULL,
    content TEXT NOT NULL,
    file_path TEXT NOT NULL,
    file_hash TEXT UNIQUE,  -- SHA-256 for deduplication
    file_size INTEGER NOT NULL,
    mime_type TEXT NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    modified_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    accessed_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    source_type TEXT NOT NULL CHECK (source_type IN ('note', 'photo', 'audio', 'document')),
    processing_status TEXT DEFAULT 'pending' CHECK (processing_status IN ('pending', 'processing', 'completed', 'failed')),
    metadata BLOB,  -- JSONB format for flexibility
    embedding BLOB,  -- Binary vector data (4096 bytes for 1024-D float32)
    embedding_version INTEGER DEFAULT 1,
    embedding_updated_at TIMESTAMP
);

-- Indexes for performance
CREATE INDEX idx_globules_created_at ON globules(created_at DESC);
CREATE INDEX idx_globules_source_type ON globules(source_type);
CREATE INDEX idx_globules_processing_status ON globules(processing_status) WHERE processing_status != 'completed';
CREATE INDEX idx_globules_file_path ON globules(file_path);
CREATE UNIQUE INDEX idx_globules_uuid ON globules(uuid);

-- Generated column for category extraction from metadata
ALTER TABLE globules ADD COLUMN category TEXT 
GENERATED ALWAYS AS (json_extract(metadata, '$.category')) STORED;
CREATE INDEX idx_globules_category ON globules(category);

-- Vector similarity search (requires sqlite-vec extension)
CREATE VIRTUAL TABLE vss_globules USING vec0(
    item_id TEXT PRIMARY KEY,
    vector FLOAT32[1024]
);

-- Full-text search
CREATE VIRTUAL TABLE fts_globules USING fts5(
    title, 
    content, 
    tags,
    content=globules,
    content_rowid=id,
    tokenize='porter unicode61'
);

-- File system tracking
CREATE TABLE file_metadata (
    file_path TEXT PRIMARY KEY,
    globule_id INTEGER NOT NULL,
    last_seen TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    checksum TEXT NOT NULL,
    FOREIGN KEY(globule_id) REFERENCES globules(id) ON DELETE CASCADE
);

-- Tag management
CREATE TABLE tags (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    name TEXT UNIQUE NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
);

CREATE TABLE globule_tags (
    globule_id INTEGER NOT NULL,
    tag_id INTEGER NOT NULL,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    PRIMARY KEY (globule_id, tag_id),
    FOREIGN KEY(globule_id) REFERENCES globules(id) ON DELETE CASCADE,
    FOREIGN KEY(tag_id) REFERENCES tags(id) ON DELETE CASCADE
);

-- Processing queue for async operations
CREATE TABLE processing_queue (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    globule_id INTEGER NOT NULL,
    operation TEXT NOT NULL CHECK (operation IN ('embed', 'parse', 'index', 'move')),
    priority INTEGER DEFAULT 5,
    created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
    started_at TIMESTAMP,
    completed_at TIMESTAMP,
    error_message TEXT,
    retry_count INTEGER DEFAULT 0,
    FOREIGN KEY(globule_id) REFERENCES globules(id) ON DELETE CASCADE
);

CREATE INDEX idx_processing_queue_priority ON processing_queue(priority DESC, created_at ASC) 
WHERE completed_at IS NULL;

2.3 Embedding Storage Strategy

Embeddings are stored as binary BLOBs for optimal performance:

def store_embedding(self, globule_id: int, embedding: np.ndarray) -> None:
    """Store embedding as binary BLOB with version tracking"""
    
    # Convert numpy array to binary format
    embedding_blob = embedding.astype(np.float32).tobytes()
    
    # Update both main table and vector index atomically
    with self.db.transaction():
        self.db.execute("""
            UPDATE globules 
            SET embedding = ?, 
                embedding_version = ?,
                embedding_updated_at = CURRENT_TIMESTAMP
            WHERE id = ?
        """, (embedding_blob, CURRENT_EMBEDDING_VERSION, globule_id))
        
        # Update vector search index
        self.db.execute("""
            INSERT OR REPLACE INTO vss_globules (item_id, vector)
            VALUES (?, ?)
        """, (str(globule_id), embedding_blob))

2.4 JSON Field Strategy

We use a hybrid approach combining JSONB with generated columns:

# Example metadata structure
metadata = {
    "category": "writing",
    "subcategory": "fantasy",
    "tags": ["dragons", "worldbuilding"],
    "parsed_entities": {
        "characters": ["Aldric", "Morwen"],
        "locations": ["Dragon's Peak"]
    },
    "source_metadata": {
        # Photo EXIF, audio duration, etc.
        "camera": "Nikon D850",
        "iso": 400
    }
}

# Store as JSONB
metadata_blob = json.dumps(metadata).encode('utf-8')

Hot fields are extracted into generated columns for indexing, while detailed metadata remains in flexible JSONB storage.

2.5 Performance Optimizations

-- Enable WAL mode for concurrent reads
PRAGMA journal_mode = WAL;

-- Optimize for our workload
PRAGMA synchronous = NORMAL;  -- Faster writes, still crash-safe
PRAGMA cache_size = 10000;    -- ~40MB cache
PRAGMA temp_store = MEMORY;   -- Temp tables in RAM
PRAGMA mmap_size = 268435456; -- 256MB memory-mapped I/O

-- Analyze tables periodically for query optimization
ANALYZE;

3. Transaction Management

3.1 Two-Phase Transaction Pattern

We implement a two-phase pattern for robustness while maintaining performance:

class TransactionManager:
    """Manages complex multi-step operations with failure recovery"""
    
    async def create_globule(self, content: str, file_path: Path) -> Globule:
        """Two-phase globule creation with compensation logic"""
        
        # Phase 1: Persistent state changes
        async with self.db.transaction() as tx:
            # 1. Write file to staging area
            staging_path = self._stage_file(content, file_path)
            
            # 2. Create database record with pending status
            globule_id = await tx.execute("""
                INSERT INTO globules (
                    title, content, file_path, file_hash, file_size,
                    mime_type, source_type, processing_status
                ) VALUES (?, ?, ?, ?, ?, ?, ?, 'pending')
                RETURNING id
            """, (...))
            
            # 3. Move file to final location
            final_path = self._move_to_semantic_path(staging_path, globule_id)
            
            # 4. Update file path and commit
            await tx.execute("""
                UPDATE globules SET file_path = ? WHERE id = ?
            """, (str(final_path), globule_id))
            
        # Phase 2: Async processing (outside transaction)
        await self.queue_processor.enqueue([
            ProcessingTask(globule_id, 'embed', priority=5),
            ProcessingTask(globule_id, 'parse', priority=5),
            ProcessingTask(globule_id, 'index', priority=3)
        ])
        
        return await self.get_globule(globule_id)
    
    def _stage_file(self, content: str, target_path: Path) -> Path:
        """Write to temp location with atomic rename"""
        staging_dir = self.storage_root / '.staging'
        staging_path = staging_dir / f"{uuid.uuid4()}.tmp"
        
        # Write with fsync for durability
        with open(staging_path, 'w', encoding='utf-8') as f:
            f.write(content)
            f.flush()
            os.fsync(f.fileno())
            
        return staging_path

3.2 Compensation and Recovery

class RecoveryManager:
    """Handles partial failures and orphaned resources"""
    
    async def recover_on_startup(self):
        """Clean up any incomplete operations from last run"""
        
        # Find orphaned staging files
        staging_files = list((self.storage_root / '.staging').glob('*.tmp'))
        for file in staging_files:
            if file.stat().st_mtime < time.time() - 3600:  # 1 hour old
                file.unlink()
                
        # Reset stuck processing tasks
        await self.db.execute("""
            UPDATE processing_queue 
            SET started_at = NULL, retry_count = retry_count + 1
            WHERE started_at < datetime('now', '-10 minutes')
            AND completed_at IS NULL
        """)
        
        # Verify file-database consistency
        await self._verify_consistency()
    
    async def _verify_consistency(self):
        """Ensure files and database are in sync"""
        
        # Check for files without DB entries
        all_files = set(self._scan_content_files())
        db_files = set(await self._get_db_file_paths())
        
        orphaned_files = all_files - db_files
        for file_path in orphaned_files:
            await self._handle_orphaned_file(file_path)
            
        # Check for DB entries without files
        missing_files = db_files - all_files
        for file_path in missing_files:
            await self._handle_missing_file(file_path)

4. Semantic Path Generation

4.1 Path Generation Algorithm

The system generates intuitive paths based on content analysis:

class SemanticPathGenerator:
    """Generates human-readable paths from content analysis"""
    
    def generate_path(self, 
                     parsed_data: dict, 
                     embedding: np.ndarray,
                     config: StorageConfig) -> Path:
        """Multi-strategy path generation"""
        
        if config.organization == 'semantic':
            return self._semantic_path(parsed_data, embedding)
        elif config.organization == 'chronological':
            return self._chronological_path()
        elif config.organization == 'hybrid':
            return self._hybrid_path(parsed_data)
    
    def _semantic_path(self, parsed_data: dict, embedding: np.ndarray) -> Path:
        """Generate path from semantic analysis"""
        
        # Start with parsed categories
        path_components = []
        
        # Primary category (from parsing service)
        if category := parsed_data.get('category'):
            path_components.append(self._sanitize_component(category))
            
        # Subcategory or theme
        if subcategory := parsed_data.get('subcategory'):
            path_components.append(self._sanitize_component(subcategory))
            
        # Keyword extraction for deeper hierarchy
        if len(path_components) < 3 and parsed_data.get('keywords'):
            keywords = self._extract_path_keywords(parsed_data['keywords'])
            path_components.extend(keywords[:3 - len(path_components)])
            
        # Limit depth
        path_components = path_components[:MAX_PATH_DEPTH]
        
        # Generate filename
        filename = self._generate_filename(parsed_data)
        
        return Path(*path_components) / filename
    
    def _extract_path_keywords(self, keywords: List[str]) -> List[str]:
        """Extract hierarchical keywords using NLP"""
        
        # Use KeyBERT or similar for extraction
        # Group by semantic similarity
        # Return hierarchical order
        
        # Simplified example:
        keyword_groups = self._cluster_keywords(keywords)
        return [group.representative for group in keyword_groups]
    
    def _generate_filename(self, parsed_data: dict) -> str:
        """Create descriptive filename without dates"""
        
        # Start with title or first line
        base_name = parsed_data.get('title', 'untitled')
        base_name = self._sanitize_component(base_name)
        
        # Add discriminator for uniqueness
        discriminator = parsed_data.get('key_phrase', '')
        if discriminator:
            base_name = f"{base_name}_{self._sanitize_component(discriminator)}"
            
        # Ensure uniqueness with counter if needed
        return self._ensure_unique_filename(base_name)

4.2 Collision Handling

def _ensure_unique_filename(self, base_name: str, directory: Path) -> str:
    """Handle filename collisions gracefully"""
    
    # Try original name first
    if not (directory / f"{base_name}.md").exists():
        return f"{base_name}.md"
        
    # Add content-based discriminator
    for i in range(1, 100):
        candidate = f"{base_name}_{i:03d}.md"
        if not (directory / candidate).exists():
            return candidate
            
    # Fallback to UUID suffix
    return f"{base_name}_{uuid.uuid4().hex[:8]}.md"

4.3 Path Sanitization

def _sanitize_component(self, component: str) -> str:
    """Make path component filesystem-safe across platforms"""
    
    # Normalize unicode to NFC
    component = unicodedata.normalize('NFC', component)
    
    # Convert to lowercase for consistency
    component = component.lower()
    
    # Replace problematic characters
    replacements = {
        '/': '_', '\\': '_', ':': '-', '*': '_',
        '?': '', '<': '', '>': '', '|': '_',
        '"': '', '\0': '', '.': '_'
    }
    
    for old, new in replacements.items():
        component = component.replace(old, new)
        
    # Strip leading/trailing dots and spaces
    component = component.strip('. ')
    
    # Limit length (leaving room for full path)
    component = component[:50]
    
    # Handle Windows reserved names
    if component.upper() in WINDOWS_RESERVED_NAMES:
        component = f"_{component}"
        
    return component or 'unnamed'

5. File System Monitoring

5.1 File Watcher Implementation

class FileSystemMonitor:
    """Monitors filesystem for external changes"""
    
    def __init__(self, storage_root: Path, storage_manager: StorageManager):
        self.storage_root = storage_root
        self.storage_manager = storage_manager
        self.observer = Observer()  # watchdog observer
        self.pending_events = {}  # For debouncing
        
    def start(self):
        """Begin monitoring with debounced event handling"""
        
        handler = GlobuleFileHandler(self)
        self.observer.schedule(
            handler,
            str(self.storage_root),
            recursive=True
        )
        
        # Use polling observer as fallback for reliability
        if not self.observer.is_alive():
            self.observer = PollingObserver()
            self.observer.schedule(handler, str(self.storage_root), recursive=True)
            
        self.observer.start()
        
class GlobuleFileHandler(FileSystemEventHandler):
    """Handles file system events with debouncing"""
    
    def __init__(self, monitor: FileSystemMonitor):
        self.monitor = monitor
        self.debounce_delay = 0.3  # 300ms
        
    def on_moved(self, event):
        if not event.is_directory and self._is_content_file(event.dest_path):
            self._debounce_event('move', event.src_path, event.dest_path)
            
    def on_modified(self, event):
        if not event.is_directory and self._is_content_file(event.src_path):
            self._debounce_event('modify', event.src_path)
            
    def _debounce_event(self, event_type: str, *args):
        """Debounce rapid events"""
        
        key = (event_type, args[0])  # Use source path as key
        
        # Cancel existing timer
        if key in self.monitor.pending_events:
            self.monitor.pending_events[key].cancel()
            
        # Schedule new timer
        timer = threading.Timer(
            self.debounce_delay,
            self._process_event,
            args=(event_type, *args)
        )
        self.monitor.pending_events[key] = timer
        timer.start()
        
    async def _process_event(self, event_type: str, *args):
        """Process debounced event"""
        
        try:
            if event_type == 'move':
                await self._handle_move(args[0], args[1])
            elif event_type == 'modify':
                await self._handle_modify(args[0])
        except Exception as e:
            logger.error(f"Error processing {event_type} event: {e}")

5.2 Race Condition Prevention

class FileLockManager:
    """Prevents concurrent access to files during processing"""
    
    def __init__(self):
        self.locks = {}
        self.lock = threading.Lock()
        
    @contextmanager
    def acquire_file_lock(self, file_path: Path):
        """Acquire exclusive lock for file operations"""
        
        lock_path = file_path.with_suffix('.lock')
        
        # Try to create lock file atomically
        try:
            fd = os.open(str(lock_path), os.O_CREAT | os.O_EXCL | os.O_RDWR)
            
            # Write PID for debugging
            os.write(fd, str(os.getpid()).encode())
            
            try:
                yield
            finally:
                os.close(fd)
                lock_path.unlink(missing_ok=True)
                
        except FileExistsError:
            # Lock held by another process
            raise FileLockedError(f"File {file_path} is being processed")

6. Embedding Management

6.1 Embedding Update Strategy

class EmbeddingManager:
    """Manages embedding generation and updates"""
    
    def should_regenerate_embedding(self, 
                                   old_content: str, 
                                   new_content: str,
                                   old_metadata: dict) -> bool:
        """Determine if embedding needs regeneration"""
        
        # Always regenerate if no existing embedding
        if not old_metadata.get('embedding_version'):
            return True
            
        # Check version mismatch
        if old_metadata['embedding_version'] < CURRENT_EMBEDDING_VERSION:
            return True
            
        # Check significant content change
        old_size = len(old_content)
        new_size = len(new_content)
        
        if old_size == 0:
            return True
            
        size_change_ratio = abs(new_size - old_size) / old_size
        
        # Regenerate if >20% size change
        if size_change_ratio > 0.2:
            return True
            
        # Check line count change for structured content
        old_lines = old_content.count('\n')
        new_lines = new_content.count('\n')
        
        if old_lines > 10:  # Only for substantial content
            line_change_ratio = abs(new_lines - old_lines) / old_lines
            if line_change_ratio > 0.15:
                return True
                
        # Sample content similarity (for small changes)
        if size_change_ratio < 0.05:
            # Use simple hash comparison for tiny edits
            return self._content_hash(old_content) != self._content_hash(new_content)
            
        return False
    
    async def update_embedding_batch(self, globule_ids: List[int]):
        """Batch update embeddings for efficiency"""
        
        # Fetch content in batch
        contents = await self.storage.get_contents_batch(globule_ids)
        
        # Generate embeddings in batch (more efficient)
        embeddings = await self.embedding_service.embed_batch(contents)
        
        # Update in transaction
        async with self.storage.transaction():
            for globule_id, embedding in zip(globule_ids, embeddings):
                await self.storage.store_embedding(globule_id, embedding)

7. Search Implementation

7.1 Hybrid Search Architecture

class HybridSearchEngine:
    """Combines FTS5 keyword search with vector similarity"""
    
    def __init__(self, storage: StorageManager):
        self.storage = storage
        self.cache = SearchCache(max_size=1000, ttl=600)  # 10 min TTL
        
    async def search(self, 
                    query: str,
                    limit: int = 20,
                    filters: Optional[SearchFilters] = None) -> List[SearchResult]:
        """Perform hybrid search with RRF ranking"""
        
        # Check cache (skip if filters present)
        cache_key = self._cache_key(query, filters)
        if not filters and (cached := self.cache.get(cache_key)):
            return cached
            
        # Parallel search execution
        fts_task = self._fts_search(query, limit * 2, filters)
        vector_task = self._vector_search(query, limit * 2, filters)
        
        fts_results, vector_results = await asyncio.gather(fts_task, vector_task)
        
        # Reciprocal Rank Fusion
        combined_results = self._reciprocal_rank_fusion(
            fts_results, 
            vector_results,
            weights={'fts': 0.6, 'vector': 0.4}
        )
        
        # Apply final limit
        final_results = combined_results[:limit]
        
        # Cache if no filters
        if not filters:
            self.cache.set(cache_key, final_results)
            
        return final_results
    
    def _reciprocal_rank_fusion(self, 
                               fts_results: List[tuple],
                               vector_results: List[tuple],
                               weights: dict) -> List[SearchResult]:
        """Combine results using RRF algorithm"""
        
        k = 60  # RRF constant
        scores = {}
        
        # Process FTS results
        for rank, (id, fts_score) in enumerate(fts_results):
            rrf_score = weights['fts'] / (k + rank + 1)
            scores[id] = scores.get(id, 0) + rrf_score
            
        # Process vector results  
        for rank, (id, distance) in enumerate(vector_results):
            rrf_score = weights['vector'] / (k + rank + 1)
            scores[id] = scores.get(id, 0) + rrf_score
            
        # Sort by combined score
        sorted_ids = sorted(scores.items(), key=lambda x: x[1], reverse=True)
        
        # Fetch full records
        return self._fetch_results([id for id, _ in sorted_ids])

7.2 Search Result Caching

class SearchCache:
    """LRU cache for search results with TTL"""
    
    def __init__(self, max_size: int = 1000, ttl: int = 600):
        self.cache = OrderedDict()
        self.timestamps = {}
        self.max_size = max_size
        self.ttl = ttl
        self.lock = threading.RLock()
        
    def get(self, key: str) -> Optional[List[SearchResult]]:
        """Get cached results if valid"""
        
        with self.lock:
            if key not in self.cache:
                return None
                
            # Check TTL
            if time.time() - self.timestamps[key] > self.ttl:
                del self.cache[key]
                del self.timestamps[key]
                return None
                
            # Move to end (LRU)
            self.cache.move_to_end(key)
            return self.cache[key]
            
    def set(self, key: str, results: List[SearchResult]):
        """Cache search results"""
        
        with self.lock:
            # Remove oldest if at capacity
            if len(self.cache) >= self.max_size:
                oldest = next(iter(self.cache))
                del self.cache[oldest]
                del self.timestamps[oldest]
                
            self.cache[key] = results
            self.timestamps[key] = time.time()
            
    def invalidate_all(self):
        """Clear cache on data changes"""
        
        with self.lock:
            self.cache.clear()
            self.timestamps.clear()

8. Performance Specifications

8.1 Performance Targets

Operation	Target Latency	Notes
Single write (with staging)	<50ms	Includes file I/O and DB insert
Batch write (100 items)	<500ms	Using transaction batching
Semantic path generation	<10ms	Cached keyword extraction
FTS5 search	<20ms	For up to 10k documents
Vector similarity search	<50ms	Using sqlite-vec with ANN
Hybrid search (cached)	<5ms	LRU cache hit
Hybrid search (uncached)	<100ms	Combined FTS + vector + RRF
File move detection	<500ms	Debounced file system events
Embedding generation	<200ms	Via embedding service

8.2 Optimization Strategies

class PerformanceOptimizer:
    """System-wide performance optimizations"""
    
    async def optimize_database(self):
        """Periodic database optimization"""
        
        # Analyze tables for query planner
        await self.db.execute("ANALYZE")
        
        # Vacuum in incremental mode
        await self.db.execute("PRAGMA incremental_vacuum")
        
        # Update statistics
        await self.db.execute("""
            SELECT COUNT(*), source_type 
            FROM globules 
            GROUP BY source_type
        """)
        
    def configure_connection(self, conn):
        """Per-connection optimizations"""
        
        conn.execute("PRAGMA journal_mode = WAL")
        conn.execute("PRAGMA synchronous = NORMAL")
        conn.execute("PRAGMA cache_size = 10000")
        conn.execute("PRAGMA temp_store = MEMORY")
        conn.execute("PRAGMA mmap_size = 268435456")

9. Backup and Recovery

9.1 Atomic Backup Strategy

class BackupManager:
    """Handles atomic backups of database and files"""
    
    async def create_backup(self, backup_path: Path) -> BackupManifest:
        """Create consistent backup of entire system"""
        
        backup_id = datetime.now().strftime("%Y%m%d_%H%M%S")
        backup_dir = backup_path / f"globule_backup_{backup_id}"
        backup_dir.mkdir(parents=True)
        
        # Phase 1: Backup SQLite database atomically
        db_backup_path = backup_dir / "globules.db"
        await self._backup_database(db_backup_path)
        
        # Phase 2: Snapshot file metadata
        file_manifest = await self._create_file_manifest()
        
        # Phase 3: Copy files with verification
        await self._backup_files(backup_dir / "files", file_manifest)
        
        # Phase 4: Create and sign manifest
        manifest = BackupManifest(
            backup_id=backup_id,
            created_at=datetime.now(),
            db_checksum=self._checksum(db_backup_path),
            file_count=len(file_manifest),
            total_size=sum(f.size for f in file_manifest)
        )
        
        manifest.save(backup_dir / "manifest.json")
        return manifest
        
    async def _backup_database(self, target_path: Path):
        """Use SQLite backup API for consistency"""
        
        async with aiosqlite.connect(self.db_path) as source:
            async with aiosqlite.connect(target_path) as target:
                await source.backup(target)

10. Data Integrity

10.1 Consistency Verification

class IntegrityChecker:
    """Verifies data consistency between database and filesystem"""
    
    async def verify_integrity(self) -> IntegrityReport:
        """Comprehensive integrity check"""
        
        report = IntegrityReport()
        
        # Check 1: Database integrity
        result = await self.db.execute("PRAGMA integrity_check")
        if result[0] != "ok":
            report.add_error("Database corruption detected")
            
        # Check 2: File-DB consistency
        db_files = await self._get_all_file_paths()
        fs_files = await self._scan_filesystem()
        
        # Missing files
        missing = db_files - fs_files
        for path in missing:
            report.add_warning(f"Missing file: {path}")
            
        # Orphaned files
        orphaned = fs_files - db_files
        for path in orphaned:
            report.add_warning(f"Orphaned file: {path}")
            
        # Check 3: Checksum verification (sample)
        sample_size = min(100, len(db_files))
        sample = random.sample(list(db_files), sample_size)
        
        for file_path in sample:
            stored_checksum = await self._get_stored_checksum(file_path)
            actual_checksum = await self._calculate_checksum(file_path)
            
            if stored_checksum != actual_checksum:
                report.add_error(f"Checksum mismatch: {file_path}")
                
        # Check 4: Embedding consistency
        missing_embeddings = await self.db.execute("""
            SELECT COUNT(*) FROM globules 
            WHERE embedding IS NULL 
            AND processing_status = 'completed'
        """)
        
        if missing_embeddings[0] > 0:
            report.add_warning(f"{missing_embeddings[0]} completed items missing embeddings")
            
        return report

30_LLD_Intelligent_Storage_Manager - asavschaeffer/globule GitHub Wiki

Storage Manager - Low Level Design

1. Introduction

1.1 Scope

1.2 Dependencies from HLD

2. Database Architecture

2.1 Technology Decision: SQLite for MVP

2.2 Core Schema Design

2.3 Embedding Storage Strategy

2.4 JSON Field Strategy

2.5 Performance Optimizations

3. Transaction Management

3.1 Two-Phase Transaction Pattern

3.2 Compensation and Recovery

4. Semantic Path Generation

4.1 Path Generation Algorithm

4.2 Collision Handling

4.3 Path Sanitization

5. File System Monitoring

5.1 File Watcher Implementation

5.2 Race Condition Prevention

6. Embedding Management

6.1 Embedding Update Strategy

7. Search Implementation

7.1 Hybrid Search Architecture

7.2 Search Result Caching

8. Performance Specifications

8.1 Performance Targets

8.2 Optimization Strategies

9. Backup and Recovery

9.1 Atomic Backup Strategy

10. Data Integrity

10.1 Consistency Verification

⚠️ GitHub.com Fallback ⚠️

30_LLD_Intelligent_Storage_Manager - asavschaeffer/globule GitHub Wiki

Storage Manager - Low Level Design

1. Introduction

1.1 Scope

1.2 Dependencies from HLD

2. Database Architecture

2.1 Technology Decision: SQLite for MVP

2.2 Core Schema Design

2.3 Embedding Storage Strategy

2.4 JSON Field Strategy

2.5 Performance Optimizations

3. Transaction Management

3.1 Two-Phase Transaction Pattern

3.2 Compensation and Recovery

4. Semantic Path Generation

4.1 Path Generation Algorithm

4.2 Collision Handling

4.3 Path Sanitization

5. File System Monitoring

5.1 File Watcher Implementation

5.2 Race Condition Prevention

6. Embedding Management

6.1 Embedding Update Strategy

7. Search Implementation

7.1 Hybrid Search Architecture

7.2 Search Result Caching

8. Performance Specifications

8.1 Performance Targets

8.2 Optimization Strategies

9. Backup and Recovery

9.1 Atomic Backup Strategy

10. Data Integrity

10.1 Consistency Verification

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️