FAISS and KNN for Embedding Retrieval - bryanneliu/Nature-Language-Processing GitHub Wiki

FAISS

FAISS Tutorial

Calculation of distances and similarities:

Common distance metrics include L2 (Euclidean distance), L1 (Manhattan distance), and inner product distance.
Inner product is a measure of similarity
index = faiss.IndexFlatL2(3) # 3-dimensional vectors
index_ip = faiss.IndexFlatIP(3) # Index for inner product similarity

FAISS supports various search methods, including IndexFlat, IndexIVF, IndexLSH, IndexIVFPQ, IndexHNSW:

IndexFlat: Simplest index structure. Performs exhaustive [ɪɡˈzɔːstɪv] (彻底的，详尽的) search over the entire dataset

Use Cases: Suitable for small to medium-sized datasets. When memory is limited and a full dataset can fit in memory. Exhaustive search: Comparing a query vector to every vector in a dataset and selecting the one(s) with the closest similarity or distance.
IndexIVF (Inverted File): Divides the dataset into multiple clusters using a clustering algorithm (e.g., k-means). Each cluster has its own sub-index (e.g., Flat index) and maintains a list of vectors assigned to that cluster.

Use Cases: Suitable for large datasets. Enables more efficient search by narrowing down the search to a specific cluster. Requires specifying the number of clusters (nlist parameter).
IndexLSH (Locality-Sensitive Hashing): Uses Locality-Sensitive Hashing to map vectors to hash codes. Hash collisions group similar vectors together.

Use Cases: Efficient for high-dimensional data. Approximate nearest neighbor search. Suitable for scenarios where approximate results are acceptable.
IndexIVFPQ (Product Quantization): Combines the advantages of IVF and product quantization. Quantizes vectors into subvectors, each with its own codebook.

Use Cases: Efficient for large datasets and high-dimensional vectors. Provides a good trade-off between accuracy and efficiency. Useful for scenarios where high recall is needed.
IndexHNSW (Hierarchical Navigable Small World): Builds a graph structure connecting vectors in a way that optimizes both recall and speed. Efficient for approximate nearest neighbor search in high-dimensional spaces.

Use Cases: Suitable for scenarios where both speed and accuracy are important. Designed for very large datasets and high-dimensional spaces.

Choosing the Right Index: Dataset Size, Dimensionality, Search Precision, Memory Constraints.

Parameters In functions like index.search, you may encounter parameters:

xb for the database vectors (training or indexed vectors)
xq for the query vectors
yb for a batch of query vectors

FAISS Practice

PySpark Notebook: %pip install faiss-cpu

FAISS and KNN for Embedding Retrieval - bryanneliu/Nature-Language-Processing GitHub Wiki

FAISS

FAISS Tutorial

FAISS Practice

KNN

Performance Improvement