Idea embedding in vectordb - robbiemu/aclarai GitHub Wiki

✅ Embed utterances and save vectors to Postgres

What it does:

For each utterance (block), run it through an embedding model and store:

  • aclarai_id
  • embedding vector
  • Optional metadata: file name, sentence index, timestamp

Stored in Postgres using pgvector, enabling:

  • Similarity search
  • Near-duplicate detection
  • Efficient claim retrieval

Implementation note:

Use a batch insert via SQLAlchemy or psycopg2, and index the vector column with ivfflat.

ideally, that task can leverage LlamaIndex’s built-in vector store abstractions, which already support pgvector and manage:

  • Embedding generation
  • Chunk tracking
  • Storage and retrieval from the database
  • Index metadata (e.g. document ID, chunk position)

✅ How LlamaIndex Fits In

In your POC, you can:

  1. Use a VectorStoreIndex from LlamaIndex, configured for pgvector.
  2. Ingest each utterance as a Node (or TextNode) with its aclarai:id as metadata.
  3. Let LlamaIndex embed and store that node to Postgres via pgvector.
  4. Query later by Node ID, metadata, or similarity.

🔌 Example Sketch

from llama_index import VectorStoreIndex, SimpleNodeParser, Document
from llama_index.vector_stores.postgres import PGVectorStore

# Init vector store
pg_store = PGVectorStore.from_params(...)

index = VectorStoreIndex.from_vector_store(pg_store)

# Wrap each utterance
docs = [Document(text=u.text, metadata={"aclarai_id": u.id}) for u in utterances]

# Add to index (embedding + write)
index.insert_documents(docs)

This saves you from:

  • Manually batching embeddings
  • Writing INSERT INTO vectors (...) SQL
  • Managing retrieval logic