postgres pgvector - ghdrako/doc_snipets GitHub Wiki

Inne rozwiazanie:

Embeddings are “numerical representations of text.” These numerical representations are then stored in PostgreSQL and can be indexed for high-performance search queries. The advantage of this approach, just like with tsvector objects, is that the processed search objects are stored right alongside your relational data.

To benefit from this semantic search, organizations pre-compute vector embeddings for the text they want to make available for searching. Then, when the user is searching, their search inputs are embedded for comparison to pre-computed embeddings. When matches are found, associated relational data can be fetched.

pgvector is a PostgreSQL extension that allows you to store, query, and index vectors.

This extension offers a vector data type for columns, various similarity search functions, and various types of indexes that are used for fast retrieval. With the combination of the types, functions, and indexes, PostgreSQL is competitive with specialized vector search databases.

The pgvector extension provides nearest neighbor search functionality using a variety of algorithms.

The algorithms are L2 (also called Euclidian Distance), inner product, and cosine distance. Each of the distance functions compares vectors using special operators. You’ll use operators like <-> (Euclidian Distance) in SQL, <#> (negative inner product), or <=> (cosine distance). The original index type for pgvector was IVFFlat. In version 0.5.0, HNSW[425] was added as a new index type. The HNSW index type offers improved search performance with the trade-off of being slower to create.

Step1 Install and configure

https://github.com/pgvector/pgvector?tab=readme-ov-file#installation

sudo apt-get install postgresql-server-dev-all

git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install

postgres=# Create extension vector;

CREATE TABLE documents (
    id int PRIMARY KEY,
    title text NOT NULL,
    content TEXT NOT NULL
);

CREATE TABLE document_embeddings (
    id int PRIMARY KEY,
    embedding vector(1536) NOT NULL
);
INSERT INTO documents VALUES ('1', 'pgvector', 'pgvector is a PostgreSQL extension that provides support for vector similarity search and nearest neighbor search in SQL.');
INSERT INTO documents VALUES ('2', 'pg_similarity', 'pg_similarity is a PostgreSQL extension that provides similarity and distance operators for vector columns.');
INSERT INTO documents VALUES ('3', 'pg_trgm', 'pg_trgm is a PostgreSQL extension that provides functions and operators for determining the similarity of alphanumeric text based on trigram matching.');
INSERT INTO documents VALUES ('4', 'pg_prewarm', 'pg_prewarm is a PostgreSQL extension that provides functions for prewarming relation data into the PostgreSQL buffer cache.');

There are two types of indexes available for pgvector: ivfflat and hnsw. They both serve different purposes:

IVFFlat (Inverted File Flat) index:
- Suitable for exact nearest neighbor searches and can tolerate slightly slower searches
- Divides the vector space into clusters, speeding up searches by first identifying relevant clusters
- Good balance of search speed and accuracy
HNSW (Hierarchical Navigable Small World) index:
- Designed for approximate nearest-neighbor searches when you need fast searches
- Creates a graph structure for swift navigation between vectors
- Extremely fast, but may occasionally miss the absolute nearest neighbor

for IVFFlat indexes, it is best to create the index after the table has some data. For HNSW indexes, there is no training step like with IVFFlat, so the index can be made without any data in the table. HNSW has better query performance than IVFFlat.

CREATE INDEX ON document_embeddings USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON document_embeddings USING hnsw (embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);
CREATE INDEX ON document_embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100)

--find the closest 5 elements to a given query
SELECT * FROM document_embeddings ORDER BY embedding <=> '[10.5, 11.0,...]' LIMIT 5;.

Step 2: Generate embeddings

https://platform.openai.com/docs/guides/embeddings
https://www.sbert.net/ - open source model
https://bielik.ai/ - model dla jezyka polskiego

use an embedding model to convert them into embeddings.

An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness, and large distances suggest low relatedness.

pgvector can be easily integrated with Python using the psycopg2 library.

!pip install psycopg2-binary numpy

import psycopg2
import numpy as np

# Connect to the database
conn = psycopg2.connect("dbname=your_database user=your_username")
cur = conn.cursor()

# Insert a vector
embedding = np.array([1.5, 2.5, 3.5])
cur.execute("INSERT INTO document_embeddings (embedding) VALUES (%s)", (embedding.tolist(),))

# Perform a similarity search
query_vector = np.array([2, 3, 4])
cur.execute("SELECT * FROM document_embeddings ORDER BY embedding <-> %s LIMIT 1", (query_vector.tolist(),))
result = cur.fetchone()
print(f"Nearest neighbor: {result}")

conn.commit()
cur.close()
conn.close()

Use embedding model “text-embedding-ada-002,”

# Python code to preprocess and embed documents
import openai
import psycopg2

# Load OpenAI API key
openai.api_key = "sk-..." #YOUR OWN API KEY

# Pick the embedding model
model_id = "text-embedding-ada-002"

# Connect to PostgreSQL database
conn = psycopg2.connect(database="postgres", user="aiuser", host="localhost", port="5432")

# Fetch documents from the database
cur = conn.cursor()
cur.execute("SELECT id, content FROM documents")
documents = cur.fetchall()

# Process and store embeddings in the database
for doc_id, doc_content in documents:
    embedding = openai.Embedding.create(input=doc_content, model=model_id)['data'][0]['embedding']
    cur.execute("INSERT INTO document_embeddings (id, embedding) VALUES (%s, %s);", (doc_id, embedding))
    conn.commit()

# Commit and close the database connection
conn.commit()

This code fetches document contents from the database, uses OpenAI API to generate embeddings, and then stores these embeddings back in the database. While this works for our small database, in a real-world scenario, you would want to use batching on existing data and an event trigger, or change streaming to keep the vectors up to date as the database changes.

Step 3: Querying embeddings

pgvector has a set of built-in functions to manipulate and perform operations on vector data. These functions allow you to calculate vector similarities, perform vector arithmetic, and more.

cosine_distance function to calculate the cosine similarity between two vectors
vector_norm() to get the Euclidean norm
vector_dims() to determine how many dimensions a vector contains
avg(vector) and sum(vector) aggregate functions for calculating analytics on vectors.

WITH matching_docs as (
  --find 5 closest matches 
  SELECT * 
  FROM documents 
  ORDER BY embedding <=> '[10.5, 11.0,...]' 
  LIMIT 5
)
SELECT d.content, a.first_name, a.last_name
FROM matching_docs d 
INNER JOIN author a ON (a.id = d.author_id).

# Python code to preprocess and embed documents
import psycopg2

# Connect to PostgreSQL database
conn = psycopg2.connect(database="postgres", user="gulcin.jelinek", host="localhost", port="5432")

cur = conn.cursor()
# Fetch extensions that are similar to pgvector based on their descriptions
query = """
WITH pgv AS (
    SELECT embedding
      FROM document_embeddings JOIN documents USING (id)
     WHERE title = 'pgvector'
)
SELECT title, content
  FROM document_embeddings
  JOIN documents USING (id)
 WHERE embedding <-> (SELECT embedding FROM pgv) < 0.5;"""
cur.execute(query)

# Fetch results
results = cur.fetchall()

# Print results in a nice format
for doc_title, doc_content in results:
    print(f"Document title: {doc_title}")
    print(f"Document text: {doc_content}")
    print()

The query first fetches an embeddings vector for the document titled “pgvector” and then uses the similarity search to get documents with similar content. Note the “<->” operator: that’s where all the pgvector magic happens. It’s how we get the similarity between two vectors using our HNSW index. The “0.5” is a similarity threshold that will be highly dependent on the use case and requires fine-tuning in real-world applications.

import openai
import psycopg2
import numpy as np

# Set up OpenAI API (replace with your actual API key)
openai.api_key = "your_openai_api_key"

# Connect to the database
conn = psycopg2.connect("dbname=your_database user=your_username")
cur = conn.cursor()

# Create a table for our documents
cur.execute("""
    CREATE TABLE IF NOT EXISTS documents (
        id SERIAL PRIMARY KEY,
        content TEXT,
        embedding vector(1536)
    )
""")

# Function to get embeddings from OpenAI
def get_embedding(text):
    response = openai.embeddings.create(input=text, model="text-embedding-ada-002")
    return response['data'][0]['embedding']

# Function to add a document
def add_document(content):
    embedding = get_embedding(content)
    cur.execute("INSERT INTO documents (content, embedding) VALUES (%s, %s)", (content, embedding))
    conn.commit()

# Function to search for similar documents
def search_documents(query, limit=5):
    query_embedding = get_embedding(query)
    cur.execute("""
        SELECT content, embedding <-> %s AS distance
        FROM documents
        ORDER BY distance
        LIMIT %s
    """, (query_embedding, limit))
    return cur.fetchall()

# Add some sample documents
sample_docs = [
    "The quick brown fox jumps over the lazy dog.",
    "Python is a high-level programming language.",
    "Vector databases are essential for modern AI applications.",
    "PostgreSQL is a powerful open-source relational database.",
]
for doc in sample_docs:
    add_document(doc)

# Perform a search
search_query = "Tell me about programming languages"
results = search_documents(search_query)
print(f"Search results for: '{search_query}'")
for i, (content, distance) in enumerate(results, 1):
    print(f"{i}. {content} (Distance: {distance:.4f})")

# Clean up
cur.close()
conn.close()

Using langChain

https://python.langchain.com/docs/integrations/vectorstores/pgvector/

from langchain_postgres.vectorstores import PGVector
from langchain.embeddings.openai import OpenAIEmbeddings

# Set up the connection string and embedding function
connection_string = "postgresql://user:pass@localhost:5432/db_name"
embedding_function = OpenAIEmbeddings()

# Create a PGVector instance
vector_store = PGVector.from_documents(
    documents,
    embedding_function,
    connection_string=connection_string
)

# Perform a similarity search
query = "Your query here"
results = vector_store.similarity_search(query)