Idea sketch how to use a BERT model with llamaindex - robbiemu/aclarai GitHub Wiki

Configuring PGVectorStore to Use a Custom BERT Model for Embeddings in LlamaIndex

This is a fantastic and very practical question! LlamaIndex's PGVectorStore (and indeed any VectorStoreIndex) doesn't directly take the embedding model as a parameter. Instead, the embedding model is configured at a higher level, typically within the LlamaIndex ServiceContext or directly when initializing the VectorStoreIndex or QueryEngine.

Here's how you'd typically set it up to use a BERT-based model (e.g., from HuggingFace) with PGVectorStore:

  1. Choose a BERT-based Embedding Model: You'll need a model that can produce embeddings. Common choices include:

    • Sentence Transformers (often built on BERT, RoBERTa, etc.): These are highly optimized for sentence-level embeddings. Examples: all-MiniLM-L6-v2, bge-small-en-v1.5.
    • Direct HuggingFace models: If you want to use a raw BERT model, you'd typically load it and its tokenizer and define a function to compute embeddings.
  2. LlamaIndex Embedding Class: LlamaIndex provides various Embedding classes to interact with different embedding providers/models. For HuggingFace models, you'd use HuggingFaceEmbedding.

  3. LlamaIndex ServiceContext: This is the central configuration object in LlamaIndex where you specify your LLM, embedding model, node parser (which includes the TextSplitter), and other components.

Here's a sketch of the code:

from llama_index.llms import OpenAI # Or Any LLM you want, even None if just for embeddings
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.node_parser import SentenceSplitter # Explicitly using SentenceSplitter
from llama_index import ServiceContext, VectorStoreIndex
from llama_index.vector_stores import PGVectorStore
from sqlalchemy import create_engine, text

# --- 1. Configure your Embedding Model ---
# For a local BERT-based model (e.g., Sentence Transformers)
# Ensure you have 'sentence-transformers' package installed
embed_model = HuggingFaceEmbedding(
    model_name="sentence-transformers/all-MiniLM-L6-v2", # Your chosen BERT-based model
    device="cuda" if torch.cuda.is_available() else "cpu" # Use GPU if available
)

# --- 2. Configure Node Parser (Text Splitter) ---
# This is where you specify your SentenceSplitter and its parameters
node_parser = SentenceSplitter(
    chunk_size=300,
    chunk_overlap=30,
    # Add any post-processing rules if the SentenceSplitter itself allows,
    # otherwise, your custom post-processing would happen *before* this.
    # For now, LlamaIndex's splitter is primarily for chunking.
)

# --- 3. Create ServiceContext ---
# This bundles your chosen components together
# You can set llm=None if this ServiceContext is strictly for embedding/storage and not query generation
service_context = ServiceContext.from_defaults(
    llm=OpenAI(model="gpt-3.5-turbo"), # Or None
    embed_model=embed_model,
    node_parser=node_parser, # Your configured SentenceSplitter
)

# --- 4. Initialize PGVectorStore ---
# Connect to your Postgres database
# You'll need to create the table and enable the pgvector extension first.
# Example: CREATE EXTENSION IF NOT EXISTS vector;
# CREATE TABLE IF NOT EXISTS public.aclarai_utterances (
#    id UUID PRIMARY KEY,
#    embedding vector(384), -- Match your model's embedding dimension (e.g., MiniLM-L6-v2 is 384)
#    text VARCHAR,
#    metadata JSONB
# );

engine = create_engine("postgresql+psycopg2://user:password@host:port/database")

# Ensure pgvector extension is enabled and table exists (for first run or setup)
with engine.connect() as connection:
    connection.execute(text("CREATE EXTENSION IF NOT EXISTS vector"))
    # The PGVectorStore constructor will automatically create the table if it doesn't exist
    # based on the collection_name. However, for specific column types/indexes,
    # you might pre-create it or pass explicit arguments.
    # For basic usage, LlamaIndex handles schema creation based on model dimension.
    connection.commit()


vector_store = PGVectorStore(
    embed_dim=embed_model.query_embedding_length, # Automatically get the embedding dimension
    collection_name="aclarai_utterances", # This will be your table name
    connection_string="postgresql+psycopg2://user:password@host:port/database",
    # Or use existing SQLAlchemy engine:
    # engine=engine
)

# --- 5. Create VectorStoreIndex ---
# Pass the configured vector store and service context
# When you insert documents into this index, LlamaIndex will use
# the embed_model and node_parser specified in the service_context.
index = VectorStoreIndex.from_vector_store(
    vector_store=vector_store,
    service_context=service_context,
)

# --- 6. Ingest Data ---
from llama_index.schema import Document

# Example: Assuming you have your Tier 1 Markdown blocks
utterance_text = "Alice: Let's release v1.2 next week.<!-- aclarai:id=blk_1a2b3c ver=1 -->^blk_1a2b3c"
# Your actual process would parse this, extract aclarai_id and the actual text.
# For simplicity, here we'll just use the text.
document = Document(
    text=utterance_text,
    metadata={
        "aclarai_id": "blk_1a2b3c",
        "chunk_index": 0, # This would be assigned by your segmentation logic
        "original_text_full_block": utterance_text # Store original full block if needed
    }
)

# When you insert, the service_context's node_parser will split,
# and its embed_model will create embeddings, which PGVectorStore will store.
index.insert(document)

# --- 7. Query (Optional, for verification) ---
# query_engine = index.as_query_engine()
# response = query_engine.query("What was discussed about release?")
# print(response)

Key Takeaways for Configuration:

  • HuggingFaceEmbedding: This is the LlamaIndex class you'll use to load your BERT-based model.
  • ServiceContext: This is the glue. It's where you tell LlamaIndex which embedding model (embed_model) and which text splitter (node_parser) to use.
  • PGVectorStore: It interacts with Postgres, and it needs to know the embed_dim (embedding dimension) which it gets from your embed_model. It doesn't need to know what kind of model it is, just its output size.
  • Database Schema: Ensure your Postgres table's vector column dimension matches the output dimension of your chosen BERT model (e.g., 384 for all-MiniLM-L6-v2, 768 for bert-base-uncased).

This setup gives you full control over the embedding model used for your PGVectorStore via LlamaIndex's robust configuration system.