postgres pgvector - ghdrako/doc_snipets GitHub Wiki
Inne rozwiazanie:
pgvector is a PostgreSQL extension that allows you to store, query, and index vectors.
Step1 Install and configure
https://github.com/pgvector/pgvector?tab=readme-ov-file#installation
sudo apt-get install postgresql-server-dev-all
git clone https://github.com/pgvector/pgvector.git
cd pgvector
make
sudo make install
postgres=# Create extension vector;
CREATE TABLE documents (
id int PRIMARY KEY,
title text NOT NULL,
content TEXT NOT NULL
);
CREATE TABLE document_embeddings (
id int PRIMARY KEY,
embedding vector(1536) NOT NULL
);
INSERT INTO documents VALUES ('1', 'pgvector', 'pgvector is a PostgreSQL extension that provides support for vector similarity search and nearest neighbor search in SQL.');
INSERT INTO documents VALUES ('2', 'pg_similarity', 'pg_similarity is a PostgreSQL extension that provides similarity and distance operators for vector columns.');
INSERT INTO documents VALUES ('3', 'pg_trgm', 'pg_trgm is a PostgreSQL extension that provides functions and operators for determining the similarity of alphanumeric text based on trigram matching.');
INSERT INTO documents VALUES ('4', 'pg_prewarm', 'pg_prewarm is a PostgreSQL extension that provides functions for prewarming relation data into the PostgreSQL buffer cache.');
There are two types of indexes available for pgvector: ivfflat and hnsw. They both serve different purposes:
-
IVFFlat (Inverted File Flat) index:
- Suitable for exact nearest neighbor searches and can tolerate slightly slower searches
- Divides the vector space into clusters, speeding up searches by first identifying relevant clusters
- Good balance of search speed and accuracy
-
HNSW (Hierarchical Navigable Small World) index:
- Designed for approximate nearest-neighbor searches when you need fast searches
- Creates a graph structure for swift navigation between vectors
- Extremely fast, but may occasionally miss the absolute nearest neighbor
for IVFFlat indexes, it is best to create the index after the table has some data. For HNSW indexes, there is no training step like with IVFFlat, so the index can be made without any data in the table. HNSW has better query performance than IVFFlat.
CREATE INDEX ON document_embeddings USING hnsw (embedding vector_cosine_ops);
CREATE INDEX ON document_embeddings USING hnsw (embedding vector_l2_ops) WITH (m = 16, ef_construction = 64);
CREATE INDEX ON document_embeddings USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100)
--find the closest 5 elements to a given query
SELECT * FROM document_embeddings ORDER BY embedding <=> '[10.5, 11.0,...]' LIMIT 5;.
Step 2: Generate embeddings
- https://platform.openai.com/docs/guides/embeddings
- https://www.sbert.net/ - open source model
- https://bielik.ai/ - model dla jezyka polskiego
use an embedding model to convert them into embeddings.
An embedding is a vector (list) of floating point numbers. The distance between two vectors measures their relatedness. Small distances suggest high relatedness, and large distances suggest low relatedness.
pgvector can be easily integrated with Python using the psycopg2 library.
!pip install psycopg2-binary numpy
import psycopg2
import numpy as np
# Connect to the database
conn = psycopg2.connect("dbname=your_database user=your_username")
cur = conn.cursor()
# Insert a vector
embedding = np.array([1.5, 2.5, 3.5])
cur.execute("INSERT INTO document_embeddings (embedding) VALUES (%s)", (embedding.tolist(),))
# Perform a similarity search
query_vector = np.array([2, 3, 4])
cur.execute("SELECT * FROM document_embeddings ORDER BY embedding <-> %s LIMIT 1", (query_vector.tolist(),))
result = cur.fetchone()
print(f"Nearest neighbor: {result}")
conn.commit()
cur.close()
conn.close()
Use embedding model “text-embedding-ada-002,”
# Python code to preprocess and embed documents
import openai
import psycopg2
# Load OpenAI API key
openai.api_key = "sk-..." #YOUR OWN API KEY
# Pick the embedding model
model_id = "text-embedding-ada-002"
# Connect to PostgreSQL database
conn = psycopg2.connect(database="postgres", user="aiuser", host="localhost", port="5432")
# Fetch documents from the database
cur = conn.cursor()
cur.execute("SELECT id, content FROM documents")
documents = cur.fetchall()
# Process and store embeddings in the database
for doc_id, doc_content in documents:
embedding = openai.Embedding.create(input=doc_content, model=model_id)['data'][0]['embedding']
cur.execute("INSERT INTO document_embeddings (id, embedding) VALUES (%s, %s);", (doc_id, embedding))
conn.commit()
# Commit and close the database connection
conn.commit()
This code fetches document contents from the database, uses OpenAI API to generate embeddings, and then stores these embeddings back in the database. While this works for our small database, in a real-world scenario, you would want to use batching on existing data and an event trigger, or change streaming to keep the vectors up to date as the database changes.
Step 3: Querying embeddings
pgvector has a set of built-in functions to manipulate and perform operations on vector data. These functions allow you to calculate vector similarities, perform vector arithmetic, and more.
- cosine_distance function to calculate the cosine similarity between two vectors
- vector_norm() to get the Euclidean norm
- vector_dims() to determine how many dimensions a vector contains
- avg(vector) and sum(vector) aggregate functions for calculating analytics on vectors.
WITH matching_docs as (
--find 5 closest matches
SELECT *
FROM documents
ORDER BY embedding <=> '[10.5, 11.0,...]'
LIMIT 5
)
SELECT d.content, a.first_name, a.last_name
FROM matching_docs d
INNER JOIN author a ON (a.id = d.author_id).
# Python code to preprocess and embed documents
import psycopg2
# Connect to PostgreSQL database
conn = psycopg2.connect(database="postgres", user="gulcin.jelinek", host="localhost", port="5432")
cur = conn.cursor()
# Fetch extensions that are similar to pgvector based on their descriptions
query = """
WITH pgv AS (
SELECT embedding
FROM document_embeddings JOIN documents USING (id)
WHERE title = 'pgvector'
)
SELECT title, content
FROM document_embeddings
JOIN documents USING (id)
WHERE embedding <-> (SELECT embedding FROM pgv) < 0.5;"""
cur.execute(query)
# Fetch results
results = cur.fetchall()
# Print results in a nice format
for doc_title, doc_content in results:
print(f"Document title: {doc_title}")
print(f"Document text: {doc_content}")
print()
The query first fetches an embeddings vector for the document titled “pgvector” and then uses the similarity search to get documents with similar content. Note the “<->” operator: that’s where all the pgvector magic happens. It’s how we get the similarity between two vectors using our HNSW index. The “0.5” is a similarity threshold that will be highly dependent on the use case and requires fine-tuning in real-world applications.
import openai
import psycopg2
import numpy as np
# Set up OpenAI API (replace with your actual API key)
openai.api_key = "your_openai_api_key"
# Connect to the database
conn = psycopg2.connect("dbname=your_database user=your_username")
cur = conn.cursor()
# Create a table for our documents
cur.execute("""
CREATE TABLE IF NOT EXISTS documents (
id SERIAL PRIMARY KEY,
content TEXT,
embedding vector(1536)
)
""")
# Function to get embeddings from OpenAI
def get_embedding(text):
response = openai.embeddings.create(input=text, model="text-embedding-ada-002")
return response['data'][0]['embedding']
# Function to add a document
def add_document(content):
embedding = get_embedding(content)
cur.execute("INSERT INTO documents (content, embedding) VALUES (%s, %s)", (content, embedding))
conn.commit()
# Function to search for similar documents
def search_documents(query, limit=5):
query_embedding = get_embedding(query)
cur.execute("""
SELECT content, embedding <-> %s AS distance
FROM documents
ORDER BY distance
LIMIT %s
""", (query_embedding, limit))
return cur.fetchall()
# Add some sample documents
sample_docs = [
"The quick brown fox jumps over the lazy dog.",
"Python is a high-level programming language.",
"Vector databases are essential for modern AI applications.",
"PostgreSQL is a powerful open-source relational database.",
]
for doc in sample_docs:
add_document(doc)
# Perform a search
search_query = "Tell me about programming languages"
results = search_documents(search_query)
print(f"Search results for: '{search_query}'")
for i, (content, distance) in enumerate(results, 1):
print(f"{i}. {content} (Distance: {distance:.4f})")
# Clean up
cur.close()
conn.close()
Using langChain
from langchain_postgres.vectorstores import PGVector
from langchain.embeddings.openai import OpenAIEmbeddings
# Set up the connection string and embedding function
connection_string = "postgresql://user:pass@localhost:5432/db_name"
embedding_function = OpenAIEmbeddings()
# Create a PGVector instance
vector_store = PGVector.from_documents(
documents,
embedding_function,
connection_string=connection_string
)
# Perform a similarity search
query = "Your query here"
results = vector_store.similarity_search(query)