Digital Persona: Data Ingestion Architecture

This document outlines the architecture of the Digital Persona’s data ingestion system. It is designed to flexibly import a wide range of personal data sources while respecting the project’s privacy-first principles.

🎯 Goals

Import personal content (emails, health logs, chat messages, calendar events, etc.)
Normalize into semantically structured formats (e.g. JSON-LD, ActivityStreams)
Store in user-owned, locally hosted memory vaults
Expose to the rest of the system through optional MCP interfaces

🔗 Supported Data Sources

Email: via IMAP, Gmail APIs, or Huginn
Calendar: Google Calendar, Apple Calendar (ICS exports, CalDAV)
Health & Fitness: MyFitnessPal, Apple Health, Fitbit (via API or scraper)
Chat Logs: Discord, SMS exports, WhatsApp (manual export), Limitless AI
Journaling/Writing: Obsidian markdown vaults, Google Docs, Notion
Media Metadata: EXIF from photos, YouTube watch history, Spotify playback

⚙️ Ingestion Pipeline

Each connector typically includes:

Fetcher: A script or agent that downloads raw data (via API, scraper, or sync)
Parser: Normalizes input to a memory object with metadata
Serializer: Converts memory object to JSON-LD/ActivityStreams
Storage Layer: Writes to persona_memory/<domain>/<source>/<timestamp>.json
Log Handler: Logs success/failure per run

Example: MyFitnessPal Connector

Authenticates using browser cookie
Scrapes daily entries
Converts to structured JSON
Writes to persona_memory/health/myfitnesspal/YYYY-MM-DD.json
Exposes via local MCP server at /mcp/health/mfp/today

🗺️ Mermaid Diagram: Ingestion Flow

flowchart TD
    A[Personal Data Source] --> B[Connector Script]
    B --> C[Fetch Raw Data]
    C --> D[Parse and Normalize]
    D --> E[Semantic JSON Transformation]
    E --> F[Write to Local Memory Vault]
    F --> G{Expose via MCP?}
    G -->|Yes| H[Run MCP Server Endpoint]
    G -->|No| I[Archive Only]
    H --> J[Queried by Persona Core or RAG]

🧠 Memory Format

All ingested data is transformed into a semantic memory entry, such as:

{
  "@context": "https://www.w3.org/ns/activitystreams",
  "type": "Note",
  "name": "Weight entry",
  "content": "Weight: 171.2 lbs",
  "published": "2025-07-11T08:45:00Z",
  "tag": ["weight", "health", "myfitnesspal"]
}

🛡️ Privacy Enforcement

All data pulled and stored locally by default
No cloud upload unless explicitly opted in by user
Use HTTPS or encryption for API connections
Option to encrypt persona_memory directory

🌐 Optional: MCP Exposure

For each memory domain (health, calendar, etc.), run a lightweight MCP server (FastAPI)
Each endpoint serves filtered and structured data
Examples:
- /mcp/calendar/today
- /mcp/health/latest
- /mcp/logs/errors

🧩 Extending Ingestion

To add a new source:

Create a new connector script
Follow the fetch → parse → normalize → store pattern
Optionally expose data via MCP
Register ingestion metadata in a .index.json file per domain for lookup

Next file: memory.md – will cover short-term and long-term memory mechanics.

Digital Persona: Data Ingestion Architecture - Hackshaven/digital-persona GitHub Wiki

Digital Persona: Data Ingestion Architecture

🎯 Goals

🔗 Supported Data Sources

⚙️ Ingestion Pipeline

Example: MyFitnessPal Connector

🗺️ Mermaid Diagram: Ingestion Flow

🧠 Memory Format

🛡️ Privacy Enforcement

🌐 Optional: MCP Exposure

🧩 Extending Ingestion

⚠️ GitHub.com Fallback ⚠️

Digital Persona: Data Ingestion Architecture - Hackshaven/digital-persona GitHub Wiki

Digital Persona: Data Ingestion Architecture

🎯 Goals

🔗 Supported Data Sources

⚙️ Ingestion Pipeline

Example: MyFitnessPal Connector

🗺️ Mermaid Diagram: Ingestion Flow

🧠 Memory Format

🛡️ Privacy Enforcement

🌐 Optional: MCP Exposure

🧩 Extending Ingestion

⚠️ **GitHub.com Fallback** ⚠️

⚠️ GitHub.com Fallback ⚠️