向量检索引擎faiss - peter-xbs/CommonCodes GitHub Wiki

局部敏感哈希(LSH)

代码实践参考CommonCodes/algorithms/lsh部分

There are several versions of LSH-each utilizing different hash building and distance/similarity metrics. The most popular approaches are:

● shingling, MinHashing, and banded LSH (traditiona. approach)

● Random hyperplanes with dot-product and Hamming distance( WHICH is more commonly used and implemented in various popular libraries such as Faiss)

random-projection: 假设有n个哈希函数，M为k * n 矩阵，输入query为k维向量，哈希后k * M结果为H= 1 *n向量，二值化处理H1 = H > 0 最终得到--> H1为长度为n的binary向量如0110...
下一步，LSH uses these values to create buckets-which will contain some reference back to our vectors.

buckets示例:

{'1000': [0], '0110': [1, 2]}

key即为我们的哈希结果，value为vector IDs. 这里我们可以看到，对于buckets dict的key最多为 2**n个下一步就要平衡检索quality和检索speed，当哈希函数个数较少时，很明显，检索质量会比较差。faiss默认哈希函数nbits为128

若buckets内召回数量过大时，可以再行遍历获取真正的topk

从上可以看出，无论是基于MinHash的还是基于Random Projection的方式，最终都是通过分桶，将query经过hash+分桶后，快速召回最想邻的buckets，大大缩小了召回范围，再进一步遍历比较获取topk方案

Faiss(Facebook AI Similarity Search)

版本

faiss-gpu faiss-cpu

针对稠密向量

IndexFlatL2 measures the L2 distance between all givenpoints between our query vector, and the vectors loaded into the index. It's simple, very accurate, bot not fast!

IndexFlatIP

暴力搜索，基于点积

IndexIVFFlat

先聚类再搜索，加快检索速度
nlist->聚类数目
nprobe-> 在多少个聚类中进行搜索，默认为1，nprobe越大，结果越精确，但是速度越慢

IndexIVFPQ

基于乘积量化(product quantizers)对存储向量进行压缩，节省存储空间
m: 乘积量化中，将原来的向量维度平均分成多少份， d必须为m的整数倍
bits: 每个子向量用多少个bits表示

如何选择合适索引

REF: https://ata.alibaba-inc.com/articles/138977?spm=ata.23639746.0.0.79821378zdhGwz