Architecture - robbiemu/aclarai GitHub Wiki

aclarai Deployment Architecture (Docker Compose Edition)

This document outlines a containerized architecture for aclarai, leveraging Docker Compose to manage heavy backend components such as Neo4j (graph DB), Postgres (vector DB), and scheduled automation jobs. This approach supports local and small-team deployments with strong isolation, observability, and development ergonomics.


🧱 Component Overview

Service Purpose
vault-watcher Scans the Obsidian vault for Markdown changes, emits dirty block IDs
aclarai-core Main processing engine: claim extraction, summarization, concept linking
postgres Stores sentence/chunk embeddings via pgvector for vector search
neo4j Holds the structured knowledge graph (claims, concepts, summaries, etc.)
scheduler Triggers nightly jobs (concept hygiene, vault sync, reprocessing)
reverse-proxy (Optional) For remote access, secure with HTTPS and auth

🐳 Docker Compose Services

# aclarai Docker Compose Stack
# no version: this is officially deprecated now

services:
  postgres:  # Vector DB backend for sentence embeddings and similarity checks
    image: ankane/pgvector
    restart: unless-stopped
    env_file:
      - .env
    environment:
      POSTGRES_DB: aclarai
    volumes:
      - pg_data:/var/lib/postgresql/data
    ports:
      - "5432:5432"
    networks:
      - aclarai_net

  neo4j:  # Knowledge graph for claims, summaries, and concepts
    image: neo4j:5
    restart: unless-stopped
    env_file:
      - .env
    environment:
      NEO4J_AUTH: ${NEO4J_USER}/${NEO4J_PASSWORD}
    volumes:
      - neo4j_data:/data
    ports:
      - "7474:7474"
      - "7687:7687"
    networks:
      - aclarai_net

  rabbitmq: # Message broker for inter-service communication (e.g., vault-watcher -> aclarai-core)
    image: rabbitmq:3-management # Includes management UI on port 15672
    restart: unless-stopped
    environment:
      RABBITMQ_DEFAULT_USER: user # Replace with environment variables or Docker secrets
      RABBITMQ_DEFAULT_PASS: password # Replace with environment variables or Docker secrets
    ports:
      - "5672:5672" # Standard AMQP port
      - "15672:15672" # Management UI
    networks:
      - aclarai_net
    healthcheck: # Basic health check
      test: ["CMD", "rabbitmq-diagnostics", "ping"]
      interval: 30s
      timeout: 10s
      retries: 5

  aclarai-core:  # Main processing pipeline: claim extraction, summarization, linking
    build: ./aclarai-core
    depends_on:
      - rabbitmq
      - postgres
      - neo4j
    volumes:
      - ../vault:/vault
      - ./settings:/settings
    env_file:
      - .env
    networks:
      - aclarai_net
    environment:
      - VAULT_PATH=/vault
      - SETTINGS_PATH=/settings
      - RABBITMQ_HOST=rabbitmq

  vault-watcher:  # Watches vault for Markdown edits and emits dirty blocks
    build: ./vault-watcher
    depends_on:
      - rabbitmq
      - aclarai-core
    volumes:
      - ../vault:/vault
      - ./settings:/settings
    env_file:
      - .env
    networks:
      - aclarai_net
    environment:
      - VAULT_PATH=/vault
      - SETTINGS_PATH=/settings
      - RABBITMQ_HOST=rabbitmq

  scheduler:  # Runs periodic jobs: concept hygiene, vault sync, reprocessing
    build: ./scheduler
    depends_on:
      - aclarai-core
    volumes:
      - ../vault:/vault
      - ./settings:/settings
    env_file:
      - .env
    networks:
      - aclarai_net
    environment:
      - VAULT_PATH=/vault
      - SETTINGS_PATH=/settings

volumes:
  pg_data:
  neo4j_data:

networks:
  aclarai_net:

⏱️ Scheduled Jobs (via scheduler)

Nightly or periodic jobs are handled by a dedicated container that can:

  • Refresh concept embeddings
  • Detect and merge near-duplicate concepts
  • Reprocess dirty Markdown blocks
  • Reconcile Neo4j with file changes (vault-to-graph sync)

The scheduler uses cron, celery beat, or a lightweight alternative like watchfiles + APScheduler.


🔄 Graph–Vault Synchronization

As described in [on_graph_vault_synchronization.md], the sync system must:

  • Watch for changes in .md files
  • Parse aclarai:id and ver= markers
  • Update Neo4j nodes or queue jobs if blocks have changed
  • Perform atomic writes (.tmprename) to ensure file safety

This sync logic is embedded within vault-watcher and executed by the nightly scheduler.


🧠 Concept Handling Strategy

Concept creation and drift merging (from [on_concepts.md]) runs in two modes:

  • During claim extraction batches: extract surface candidates, check similarity (via HNSW), create or update (:Concept) nodes.
  • Nightly job: refresh embeddings from Tier 3 files, detect/merge near-duplicates, handle file → graph hygiene.

📦 Build & Extend

  • Each service has its own Dockerfile.
  • Models (e.g. LLMs, embedders) can be configured via .env or mounted config files.
  • User-editable configurations, such as LLM prompt templates, are located in the ./settings directory, which is mounted into each service. This allows for direct modification of prompts without rebuilding containers.
  • Plugins for new file formats or additional graph analyses can mount as volumes or live in their own containers.

🔐 Security & Deployment Notes

  • You can override any .env variable (e.g., database URLs) by setting it in your shell before running docker compose, or by mounting your own .env file.

  • To use your own Neo4j or Postgres instance, set GRAPH_DB_URL and VECTOR_DB_URL in .env. When pointing to services outside Docker (e.g., on host), use host.docker.internal as the hostname inside containers.

  • If your Neo4j or Postgres service is hosted remotely or outside this Compose stack, you may comment out or remove the corresponding service block in docker-compose.yml to avoid conflicts.

  • Secrets (e.g. database passwords, auth tokens) should be stored in a .env file or managed with Docker secrets.

  • The Docker Compose file now references these via env_file instead of hardcoded values.

  • If deployed remotely, use a reverse proxy (e.g. Traefik or Caddy) to add HTTPS, rate limiting, and auth.

  • Optional: expose a REST/gRPC API gateway for headless interaction.


✅ Summary

This Docker Compose setup cleanly separates aclarai’s data storage (Neo4j, Postgres), processing agents (claimify, summarizer, concept linker), and scheduling logic. It offers a reproducible local deployment and a strong base for small-team or remote usage.