FAISS and KNN for Embedding Retrieval - bryanneliu/Nature-Language-Processing GitHub Wiki

FAISS

FAISS Tutorial

Calculation of distances and similarities:

  • Common distance metrics include L2 (Euclidean distance), L1 (Manhattan distance), and inner product distance.
  • Inner product is a measure of similarity
  • index = faiss.IndexFlatL2(3) # 3-dimensional vectors
  • index_ip = faiss.IndexFlatIP(3) # Index for inner product similarity

FAISS supports various search methods, including IndexFlat, IndexIVF, IndexLSH, IndexIVFPQ, IndexHNSW:

  • IndexFlat: Simplest index structure. Performs exhaustive [ɪɡˈzɔːstɪv] (彻底的,详尽的) search over the entire dataset

    Use Cases: Suitable for small to medium-sized datasets. When memory is limited and a full dataset can fit in memory. Exhaustive search: Comparing a query vector to every vector in a dataset and selecting the one(s) with the closest similarity or distance.

  • IndexIVF (Inverted File): Divides the dataset into multiple clusters using a clustering algorithm (e.g., k-means). Each cluster has its own sub-index (e.g., Flat index) and maintains a list of vectors assigned to that cluster.

    Use Cases: Suitable for large datasets. Enables more efficient search by narrowing down the search to a specific cluster. Requires specifying the number of clusters (nlist parameter).

  • IndexLSH (Locality-Sensitive Hashing): Uses Locality-Sensitive Hashing to map vectors to hash codes. Hash collisions group similar vectors together.

    Use Cases: Efficient for high-dimensional data. Approximate nearest neighbor search. Suitable for scenarios where approximate results are acceptable.

  • IndexIVFPQ (Product Quantization): Combines the advantages of IVF and product quantization. Quantizes vectors into subvectors, each with its own codebook.

    Use Cases: Efficient for large datasets and high-dimensional vectors. Provides a good trade-off between accuracy and efficiency. Useful for scenarios where high recall is needed.

  • IndexHNSW (Hierarchical Navigable Small World): Builds a graph structure connecting vectors in a way that optimizes both recall and speed. Efficient for approximate nearest neighbor search in high-dimensional spaces.

    Use Cases: Suitable for scenarios where both speed and accuracy are important. Designed for very large datasets and high-dimensional spaces.

Choosing the Right Index: Dataset Size, Dimensionality, Search Precision, Memory Constraints.

Parameters In functions like index.search, you may encounter parameters:

  • xb for the database vectors (training or indexed vectors)
  • xq for the query vectors
  • yb for a batch of query vectors

FAISS Practice

PySpark Notebook: %pip install faiss-cpu

KNN

Performance Improvement