Idea embedding in vectordb - robbiemu/aclarai GitHub Wiki

✅ Embed utterances and save vectors to Postgres

What it does:

For each utterance (block), run it through an embedding model and store:

aclarai_id
embedding vector
Optional metadata: file name, sentence index, timestamp

Stored in Postgres using pgvector, enabling:

Similarity search
Near-duplicate detection
Efficient claim retrieval

Implementation note:

Use a batch insert via SQLAlchemy or psycopg2, and index the vector column with ivfflat.

ideally, that task can leverage LlamaIndex’s built-in vector store abstractions, which already support pgvector and manage:

Embedding generation
Chunk tracking
Storage and retrieval from the database
Index metadata (e.g. document ID, chunk position)

✅ How LlamaIndex Fits In

In your POC, you can:

Use a VectorStoreIndex from LlamaIndex, configured for pgvector.
Ingest each utterance as a Node (or TextNode) with its aclarai:id as metadata.
Let LlamaIndex embed and store that node to Postgres via pgvector.
Query later by Node ID, metadata, or similarity.

🔌 Example Sketch

from llama_index import VectorStoreIndex, SimpleNodeParser, Document
from llama_index.vector_stores.postgres import PGVectorStore

# Init vector store
pg_store = PGVectorStore.from_params(...)

index = VectorStoreIndex.from_vector_store(pg_store)

# Wrap each utterance
docs = [Document(text=u.text, metadata={"aclarai_id": u.id}) for u in utterances]

# Add to index (embedding + write)
index.insert_documents(docs)

This saves you from:

Manually batching embeddings
Writing INSERT INTO vectors (...) SQL
Managing retrieval logic