FAISS and KNN for Embedding Retrieval - bryanneliu/Nature-Language-Processing GitHub Wiki
FAISS
- https://engineering.fb.com/2017/03/29/data-infrastructure/faiss-a-library-for-efficient-similarity-search/
- https://github.com/facebookresearch/faiss
- https://faiss.ai/
FAISS Tutorial
Calculation of distances and similarities:
- Common distance metrics include L2 (Euclidean distance), L1 (Manhattan distance), and inner product distance.
- Inner product is a measure of similarity
- index = faiss.IndexFlatL2(3) # 3-dimensional vectors
- index_ip = faiss.IndexFlatIP(3) # Index for inner product similarity
FAISS supports various search methods, including IndexFlat, IndexIVF, IndexLSH, IndexIVFPQ, IndexHNSW:
- IndexFlat: Simplest index structure. Performs exhaustive [ɪɡˈzɔːstɪv] (彻底的,详尽的) search over the entire dataset
Use Cases: Suitable for small to medium-sized datasets. When memory is limited and a full dataset can fit in memory. Exhaustive search: Comparing a query vector to every vector in a dataset and selecting the one(s) with the closest similarity or distance.
- IndexIVF (Inverted File): Divides the dataset into multiple clusters using a clustering algorithm (e.g., k-means). Each cluster has its own sub-index (e.g., Flat index) and maintains a list of vectors assigned to that cluster.
Use Cases: Suitable for large datasets. Enables more efficient search by narrowing down the search to a specific cluster. Requires specifying the number of clusters (nlist parameter).
- IndexLSH (Locality-Sensitive Hashing): Uses Locality-Sensitive Hashing to map vectors to hash codes. Hash collisions group similar vectors together.
Use Cases: Efficient for high-dimensional data. Approximate nearest neighbor search. Suitable for scenarios where approximate results are acceptable.
- IndexIVFPQ (Product Quantization): Combines the advantages of IVF and product quantization. Quantizes vectors into subvectors, each with its own codebook.
Use Cases: Efficient for large datasets and high-dimensional vectors. Provides a good trade-off between accuracy and efficiency. Useful for scenarios where high recall is needed.
- IndexHNSW (Hierarchical Navigable Small World): Builds a graph structure connecting vectors in a way that optimizes both recall and speed.
Efficient for approximate nearest neighbor search in high-dimensional spaces.
Use Cases: Suitable for scenarios where both speed and accuracy are important. Designed for very large datasets and high-dimensional spaces.
Choosing the Right Index: Dataset Size, Dimensionality, Search Precision, Memory Constraints.
Parameters In functions like index.search, you may encounter parameters:
- xb for the database vectors (training or indexed vectors)
- xq for the query vectors
- yb for a batch of query vectors
FAISS Practice
PySpark Notebook: %pip install faiss-cpu