31_Research_Intelligent_Storage_Manager - asavschaeffer/globule GitHub Wiki
The Storage Manager is the beating heart of Globule, a semantic knowledge management system that aims to transform how you capture, organize, and retrieve personal knowledge. Imagine it as a librarian who not only shelves your books but also understands their content, groups them intuitively, and fetches them instantly when you need them—all while working seamlessly across Windows, macOS, and Linux. It bridges the gap between traditional filesystems (think folders and files) and modern AI-driven capabilities like semantic search and vector embeddings. This isn’t just about storing files—it’s about making your knowledge come alive.
For this Minimum Viable Product (MVP), we’re targeting a single-user, local-first experience with 100-200 daily inputs (notes, photos, ideas), but we’re designing with an eye toward future scalability (think multiplayer collaboration). Below, I’ll unpack every layer of the design, from the nitty-gritty of database choices to the elegance of semantic directories, with plenty of examples and insights to satisfy your appetite for detail.
The database is the backbone of the Storage Manager, holding everything from raw content to AI-generated embeddings. Let’s explore this in exhaustive detail.
Choosing a database is like picking a car for a road trip. SQLite is a zippy little hatchback—perfect for a solo journey, easy to park, and low-maintenance. PostgreSQL is a luxury SUV—great for a group, packed with features, but overkill for a quick drive. Here’s the full breakdown:
-
Why It Wins:
-
Zero Configuration: SQLite runs in-process, meaning no server setup. You drop a
.db
file on your disk, and you’re good to go. For a single user, this is a dream—no IT degree required. - Speed: With no network overhead, SQLite delivers sub-millisecond latency for reads and writes. For 200 daily inputs, that’s snappy performance you’ll feel instantly.
- Portability: One file contains everything. Move it to a USB drive, sync it with Dropbox, or email it—no fuss.
- Concurrency: Write-Ahead Logging (WAL) lets multiple processes read while one writes, which is plenty for a single-user app. Think of it like a checkout line: one cashier, but lots of browsers.
- Proven Track Record: Apps like Obsidian manage thousands of notes with SQLite, handling 10,000+ operations per second with the right setup.
-
Zero Configuration: SQLite runs in-process, meaning no server setup. You drop a
-
The Catch:
-
Write Limits: Only one writer at a time. If two processes try to write simultaneously, one waits (or fails with
SQLITE_BUSY
). For the MVP, this is fine—your app will queue writes—but it’s a red flag for future multi-user dreams. - Scale Ceiling: SQLite caps out at terabytes and billions of rows, but performance dips with heavy concurrent writes or massive datasets.
-
Write Limits: Only one writer at a time. If two processes try to write simultaneously, one waits (or fails with
-
Why It Shines:
- Concurrency Superpower: Multi-Version Concurrency Control (MVCC) lets many users read and write without stepping on each other’s toes. It’s like a multi-lane highway versus SQLite’s single track.
-
Feature Rich: Built-in JSONB indexing, replication, and extensions like
pgvector
for vector search make it a beast for advanced needs. - Scalability: It scales horizontally (multiple servers) and vertically (beefier hardware), ready for a “Globule Cloud” someday.
-
The Downside:
- Complexity: You need to install, configure, and maintain a server. For a local-first MVP, that’s like bringing a sledgehammer to crack a nut.
- Latency: Client-server communication adds milliseconds—small, but noticeable compared to SQLite’s instant access.
- MVP: SQLite, hands down. It’s lightweight, fast, and fits the single-user focus. We’ll use a Data Access Layer (DAL) to keep our options open, so switching to PostgreSQL later is just a config tweak.
- Future: If Globule goes multiplayer, PostgreSQL’s concurrency and features will take the wheel.
Imagine you jot down 10 notes in a minute. With SQLite, each write takes ~0.1ms, totaling 1ms—blink-and-you-miss-it speed. PostgreSQL might add 5-10ms of network lag, which isn’t bad, but why pay the toll for a solo trip?
Globule uses embeddings—high-dimensional vectors (e.g., 1024 floats)—to power semantic search. Storing these efficiently is key.
- BLOBs Rule: We pack embeddings into binary large objects (BLOBs) in SQLite. A 1024-D float32 vector is 4096 bytes—compact and fast to read/write.
-
Why Not Text? Storing as JSON (
[0.1, 0.2, ...]
) bloats the size and slows parsing. BLOBs are the lean, mean choice. -
Vector Search: We use
sqlite-vec
, a SQLite extension for vector indexing. It builds virtual tables (e.g.,vss_globules
) for lightning-fast K-Nearest Neighbor (KNN) queries.
CREATE TABLE globules (
id INTEGER PRIMARY KEY,
embedding BLOB -- 4096 bytes of binary goodness
);
CREATE VIRTUAL TABLE vss_globules USING vec0(
item_id TEXT,
vector FLOAT32(1024)
);
INSERT INTO vss_globules (item_id, vector) VALUES ('glob123', ?); -- Binary vector data
-
Dual Tables: The
globules
table holds the master data;vss_globules
indexes embeddings. We tie them withid
and use transactions to keep updates atomic:BEGIN; UPDATE globules SET embedding = ? WHERE id = 123; UPDATE vss_globules SET vector = ? WHERE item_id = '123'; COMMIT;
What if an embedding update fails halfway? Transactions ensure both tables stay in sync—or roll back if something goes awry. No orphaned vectors here!
Globules come in all flavors—notes, photos, PDFs—each with unique metadata. We need a storage strategy that’s adaptable yet performant.
-
What It Does: Stores metadata as JSON in TEXT columns (e.g.,
{"category": "writing", "tags": ["fantasy"]}
). -
Pros: No schema changes needed for new fields. Add
"camera": "Nikon"
to a photo globule? Done. -
Querying: Use
json_extract(metadata, '$.tags')
to pluck values. It’s like a treasure hunt without rewriting the map.
- Upgrade: SQLite 3.45.0+ offers JSONB in BLOBs—binary JSON that’s 5-10% smaller and up to 5x faster to process.
- Trade-Off: No O(1) lookups like PostgreSQL’s JSONB, but still a big win for frequent access.
-
Best of Both Worlds:
-
Columns for Speed: Pull out hot fields like
created_at
,source_type
, andtags
into dedicated columns with indexes. -
Generated Columns: Index JSON fields dynamically, e.g.,
ALTER TABLE globules ADD COLUMN category TEXT GENERATED ALWAYS AS (json_extract(metadata, '$.category')) STORED;
. - JSONB for the Rest: Stash detailed stuff (e.g., photo EXIF data) in JSONB.
-
Columns for Speed: Pull out hot fields like
CREATE TABLE globules (
id INTEGER PRIMARY KEY,
title TEXT NOT NULL,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
source_type TEXT, -- 'note', 'photo', etc.
metadata BLOB, -- JSONB for {"exif": {...}, "theme": "travel"}
embedding BLOB
);
CREATE INDEX idx_created_at ON globules(created_at);
CREATE INDEX idx_category ON globules(json_extract(metadata, '$.category'));
Think of this as a filing cabinet. The labels on the drawers (created_at
, source_type
) are indexed for quick access. Inside, a flexible folder (JSONB) holds everything else—neat, yet adaptable.
With thousands of globules, queries need to be instant. Indexes are our secret sauce.
-
Use Case: Speed up
WHERE
,JOIN
, andORDER BY
.-
Temporal:
CREATE INDEX idx_timestamps ON globules(created_at, modified_at);
-
Type:
CREATE INDEX idx_type ON globules(source_type);
-
Temporal:
- Power: FTS5 virtual tables handle keyword searches with stemming (e.g., “run” matches “running”) and Unicode support.
-
Setup:
CREATE VIRTUAL TABLE fts_globules USING fts5( title, content, tags, content=globules, content_rowid=id, tokenize='porter unicode61' );
-
Query:
SELECT * FROM fts_globules WHERE fts_globules MATCH 'semantic NEAR/5 search';
-
Tool:
sqlite-vec
indexes embeddings for similarity searches. -
Example:
SELECT item_id, distance FROM vss_globules WHERE vector MATCH ? ORDER BY distance LIMIT 10;
-
Trick: Index JSON fields with expression indexes:
CREATE INDEX idx_camera ON globules(json_extract(metadata, '$.camera.model')) WHERE source_type = 'photo';
Too many indexes slow writes. We’ll monitor with EXPLAIN QUERY PLAN
and prune unused ones.
Here’s the whole enchilada:
CREATE TABLE globules (
id INTEGER PRIMARY KEY,
uuid TEXT NOT NULL UNIQUE,
title TEXT NOT NULL,
content TEXT NOT NULL,
file_path TEXT NOT NULL,
file_hash TEXT UNIQUE,
file_size INTEGER,
mime_type TEXT,
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
modified_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP,
source_type TEXT,
metadata BLOB,
embedding BLOB
);
CREATE VIRTUAL TABLE vss_globules USING vec0(item_id TEXT, vector FLOAT32(1024));
CREATE VIRTUAL TABLE fts_globules USING fts5(title, content, tags, content=globules, content_rowid=id);
CREATE TABLE file_metadata (
file_path TEXT PRIMARY KEY,
item_id TEXT,
last_modified TIMESTAMP,
checksum TEXT,
FOREIGN KEY(item_id) REFERENCES globules(id)
);
CREATE TABLE tags (
id INTEGER PRIMARY KEY,
name TEXT UNIQUE
);
CREATE TABLE globule_tags (
globule_id INTEGER,
tag_id INTEGER,
PRIMARY KEY (globule_id, tag_id),
FOREIGN KEY(globule_id) REFERENCES globules(id),
FOREIGN KEY(tag_id) REFERENCES tags(id)
);
Now, let’s make your filesystem as smart as your brain. Semantic directories organize files based on meaning, not just dates or manual folders.
We analyze content to build intuitive paths like /writing/fantasy/characters
.
-
How: Use Parsing Service outputs (e.g.,
categories = ["writing", "fantasy"]
). -
Example: A note about dragons becomes
/writing/fantasy/dragons/note.md
.
- Keyword Extraction: Tools like KeyBERT pull terms (e.g., “SQLite”, “WAL” from a database note).
-
Clustering: Hierarchical Agglomerative Clustering (HAC) builds a topic tree:
Databases > Concurrency > WAL
-
Path:
Databases/Concurrency/WAL/optimizing-wal.md
def generate_semantic_path(content):
keywords = extract_keywords(content) # ['sqlite', 'wal', 'performance']
tree = cluster_keywords(keywords) # { 'databases': { 'concurrency': ['wal'] } }
path = ['databases', 'concurrency', 'wal']
return '/'.join(path)
- Embeddings: Cluster globules with k-means for tighter groups.
- Dynamic Adjustments: Reorganize paths as new content shifts the semantic landscape.
Two files can’t share the same path. Here’s how we dodge that bullet:
-
Unique Suffixes:
dragon-lore.md
becomesdragon-lore_123.md
. -
Smart Naming: Use content snippets (e.g.,
dragon-lore-fire-breathing.md
).
/writing/fantasy/dragons/dragon-lore_001.md
/writing/fantasy/dragons/dragon-lore_002.md
-
Depth: Cap at 5 levels (
/a/b/c/d/e
) for usability. - Length: Trim to 200 chars to stay under Windows’ 260-char limit (extendable with hacks).
-
Sanitization: Replace
/
with_
, strip<|>
, normalize Unicode.
Metadata (embeddings, tags, etc.) needs a home. We’re going hybrid.
- Extended Attributes (xattr): Fast on Unix, but Windows says no. Sync tools often strip them too.
-
Companion Files:
.globule
sidecars (e.g.,note.md.globule
) are portable but cluttery. - Database-Only: SQLite centralizes everything, but file moves break links.
- SQLite Core: Store all metadata in the database for integrity and queries.
-
Fallback: Embed UUIDs in file frontmatter (text files) or tiny
.globule
files (binaries). - Watcher: A filesystem monitor updates paths when files move.
{
"uuid": "abc123",
"file_hash": "sha256:xyz",
"metadata": {"tags": ["fantasy"]}
}
If you rename note.md
to new-note.md
outside Globule, the watcher spots it, updates file_path
in SQLite, and all’s well.
Globule runs everywhere. Here’s how:
-
Tool: Python’s
pathlib
abstracts/
vs\
. -
Storage: Use
/
internally, translate at runtime.
-
Windows Quirks: No
CON.txt
, cap at 255 chars per name. -
Case: Force lowercase (
Note.md
→note.md
) for consistency. - Unicode: Normalize to NFC (e.g., “café” stays one codepoint).
My/Note?.md
becomes my_note_.md
.
Choose your flavor: semantic, chronological, or hybrid.
-
Path:
/writing/fantasy/dragons/
- Pro: Matches your mental model.
-
Path:
/2025/07/12/note.md
- Pro: Simple, unique.
-
Path:
/writing/2025/07/12/note.md
- Pro: Best of both.
A script re-maps paths based on your mode choice, using SQLite as the source of truth.
Speed is king. Here’s the playbook:
-
Setup:
PRAGMA journal_mode=WAL;
- Win: Reads during writes, no blocking.
- Example: Insert 100 notes in one transaction (~50ms vs 100ms individually).
PRAGMA synchronous = NORMAL; -- Faster, still safe
PRAGMA cache_size = 10000; -- 4MB cache
PRAGMA mmap_size = 268435456; -- 256MB memory mapping
-
Inbox: Drop files in
/inbox/2025/07/12/
instantly. - Processor: Background job adds embeddings, moves to semantic paths.
Your data’s safe with us.
- Atomic: SQLite Backup API + rsync for files.
-
Steps:
-
sqlite3_backup_init
for a.db
snapshot. - Copy files with checksums.
-
- WAL: Rolls back crashes automatically.
-
Checks:
PRAGMA integrity_check
spots issues.
Find anything, fast.
- Combo: FTS5 (keywords) + sqlite-vec (vectors).
-
Ranking: Reciprocal Rank Fusion (RRF):
score = 1/(60 + fts_rank) * 0.6 + 1/(60 + vector_distance) * 0.4
Search “dragon lore”:
- FTS5: Matches
title
,content
. - Vector: Finds semantically similar globules.
- RRF: Blends results for top 20 hits.