Digital Persona: Data Ingestion Architecture - Hackshaven/digital-persona GitHub Wiki
This document outlines the architecture of the Digital Persona’s data ingestion system. It is designed to flexibly import a wide range of personal data sources while respecting the project’s privacy-first principles.
- Import personal content (emails, health logs, chat messages, calendar events, etc.)
- Normalize into semantically structured formats (e.g. JSON-LD, ActivityStreams)
- Store in user-owned, locally hosted memory vaults
- Expose to the rest of the system through optional MCP interfaces
- Email: via IMAP, Gmail APIs, or Huginn
- Calendar: Google Calendar, Apple Calendar (ICS exports, CalDAV)
- Health & Fitness: MyFitnessPal, Apple Health, Fitbit (via API or scraper)
- Chat Logs: Discord, SMS exports, WhatsApp (manual export), Limitless AI
- Journaling/Writing: Obsidian markdown vaults, Google Docs, Notion
- Media Metadata: EXIF from photos, YouTube watch history, Spotify playback
Each connector typically includes:
- Fetcher: A script or agent that downloads raw data (via API, scraper, or sync)
- Parser: Normalizes input to a memory object with metadata
- Serializer: Converts memory object to JSON-LD/ActivityStreams
-
Storage Layer: Writes to
persona_memory/<domain>/<source>/<timestamp>.json - Log Handler: Logs success/failure per run
- Authenticates using browser cookie
- Scrapes daily entries
- Converts to structured JSON
- Writes to
persona_memory/health/myfitnesspal/YYYY-MM-DD.json - Exposes via local MCP server at
/mcp/health/mfp/today
flowchart TD
A[Personal Data Source] --> B[Connector Script]
B --> C[Fetch Raw Data]
C --> D[Parse and Normalize]
D --> E[Semantic JSON Transformation]
E --> F[Write to Local Memory Vault]
F --> G{Expose via MCP?}
G -->|Yes| H[Run MCP Server Endpoint]
G -->|No| I[Archive Only]
H --> J[Queried by Persona Core or RAG]
All ingested data is transformed into a semantic memory entry, such as:
{
"@context": "https://www.w3.org/ns/activitystreams",
"type": "Note",
"name": "Weight entry",
"content": "Weight: 171.2 lbs",
"published": "2025-07-11T08:45:00Z",
"tag": ["weight", "health", "myfitnesspal"]
}- All data pulled and stored locally by default
- No cloud upload unless explicitly opted in by user
- Use HTTPS or encryption for API connections
- Option to encrypt
persona_memorydirectory
-
For each memory domain (health, calendar, etc.), run a lightweight MCP server (FastAPI)
-
Each endpoint serves filtered and structured data
-
Examples:
/mcp/calendar/today/mcp/health/latest/mcp/logs/errors
To add a new source:
- Create a new connector script
- Follow the fetch → parse → normalize → store pattern
- Optionally expose data via MCP
- Register ingestion metadata in a
.index.jsonfile per domain for lookup
Next file: memory.md – will cover short-term and long-term memory mechanics.