Idea embedding in vectordb - robbiemu/aclarai GitHub Wiki
✅ Embed utterances and save vectors to Postgres
What it does:
For each utterance (block), run it through an embedding model and store:
aclarai_id
embedding vector
- Optional metadata: file name, sentence index, timestamp
Stored in Postgres using pgvector
, enabling:
- Similarity search
- Near-duplicate detection
- Efficient claim retrieval
Implementation note:
Use a batch insert via SQLAlchemy or psycopg2, and index the vector column with ivfflat
.
ideally, that task can leverage LlamaIndex’s built-in vector store abstractions, which already support pgvector
and manage:
- Embedding generation
- Chunk tracking
- Storage and retrieval from the database
- Index metadata (e.g. document ID, chunk position)
✅ How LlamaIndex Fits In
In your POC, you can:
- Use a
VectorStoreIndex
from LlamaIndex, configured for pgvector. - Ingest each utterance as a
Node
(orTextNode
) with itsaclarai:id
as metadata. - Let LlamaIndex embed and store that node to Postgres via pgvector.
- Query later by
Node ID
,metadata
, or similarity.
🔌 Example Sketch
from llama_index import VectorStoreIndex, SimpleNodeParser, Document
from llama_index.vector_stores.postgres import PGVectorStore
# Init vector store
pg_store = PGVectorStore.from_params(...)
index = VectorStoreIndex.from_vector_store(pg_store)
# Wrap each utterance
docs = [Document(text=u.text, metadata={"aclarai_id": u.id}) for u in utterances]
# Add to index (embedding + write)
index.insert_documents(docs)
This saves you from:
- Manually batching embeddings
- Writing
INSERT INTO vectors (...)
SQL - Managing retrieval logic