Faiss building blocks: clustering, PCA, quantization - tarang-jain/faiss GitHub Wiki
Faiss is built on a few basic algorithms with very efficient implementations: k-means clustering, PCA, PQ encoding/decoding.
Clustering
Faiss provides an efficient k-means implementation. Cluster a set of vectors stored in a given 2-D tensor x
is done as follows:
ncentroids = 1024
niter = 20
verbose = True
d = x.shape[1]
kmeans = faiss.Kmeans(d, ncentroids, niter=niter, verbose=verbose)
kmeans.train(x)
The resulting centroids are in kmeans.centroids
.
The values of the objective function (total square error in case of k-means) along iterations is stored in the variable kmeans.obj
, and more extensive statistics in kmeans.iteration_stats
.
To run it on GPU, add the option gpu=True
to the Kmeans constructor. This will use all available GPUs on the machine.
Additional options
The Kmeans
object is mainly a layer of the C++ Clustering
object, and all fields of that object can be set via the constructor.
The fields include:
-
nredo
: run the clustering this number of times, and keep the best centroids (selected according to clustering objective) -
verbose
: make clustering more verbose -
spherical
: perform spherical k-means -- the centroids are L2 normalized after each iteration -
int_centroids
: round centroids coordinates to integer -
update_index
: re-train index after each iteration? -
min_points_per_centroid
/max_points_per_centroid
: below, you get a warning, above the training set is subsampled -
seed
: seed for the random number generator
Assignment
To compute the mapping from a set of vectors x
to the cluster centroids after kmeans has finished training, use:
D, I = kmeans.index.search(x, 1)
This will return the nearest centroid for each line vector in x
in I
. D
contains the squared L2 distances.
For the reverse operation, eg. to find the 15 nearest points in x
to the computed centroids, a new index must be used:
index = faiss.IndexFlatL2 (d)
index.add (x)
D, I = index.search (kmeans.centroids, 15)
I
of size (ncentroids, 15) contains the nearest neighbors for each centroid.
Clustering on the GPU
Clustering on one or several GPUs can be done via the gpu=True
(use all gpus) or gpu=3
(use 3 gpus) constructor option in the KMeans
object.
Computing a PCA
Let's reduce 40D vectors to 10D.
# random training data
mt = np.random.rand(1000, 40).astype('float32')
mat = faiss.PCAMatrix (40, 10)
mat.train(mt)
assert mat.is_trained
tr = mat.apply(mt)
# print this to show that the magnitude of tr's columns is decreasing
print (tr ** 2).sum(0)
Quantizers
The quantizer objects inherit from Quantizer
that offers three common methods (see impl/Quantizer.h) :
-
train
: train the quantizer on a matrix of vectors -
compute_codes
anddecode
: encoder and decoder. The encoder is usually lossy and returns auint8
matrix of codes for each input vector. -
get_DistanceComputer
is a method returning aDistanceComputer
object (see below).
The state of the Quantizer
object is the result of the training (codebook tables or normalization coefficients).
The field code_size
indicates the number of bytes per codes generated by the quantizer.
Types of quantizers
The supported quantizer types are:
-
ScalarQuantizer
: quantizes each vector component separately in a linear range. -
ProductQuantizer
: performs vector quantization on sub-vectors -
AdditiveQuantizer
: encodes a vector as a sum of codebook entries, see Addtive Quantizers for details. Additive quantizers can be trained in several ways, hence the sub-classesResidualQuantizer
,LocalSearchQuantizer
,ProductAdditiveQuantizer
.
Interestingly, each quantizer is a superset of the previous one.
Each quantizer has a corresponding index type that also stores a set of quantized vectors.
Quantizer class | Flat index class | IVF index class |
---|---|---|
ScalarQuantizer |
IndexScalarQuantizer |
IndexIVFScalarQuantizer |
ProductQuantizer |
IndexPQ |
IndexIVFPQ |
AdditiveQuantizer |
IndexAdditiveQuantizer |
IndexIVFAdditiveQuantizer |
ResidualQuantizer |
IndexResidualQuantizer |
IndexIVFResidualQuantizer |
LocalSearchQuantizer |
IndexLocalSearchQuantizer |
IndexIVFLocalSearchQuantizer |
In addition all indexes except the ScalarQuantizer ones exist in a FastScan
version, see Fast accumulation of PQ and AQ codes
Distance computers
Some quantizers return a DistanceComputer
object.
Its purpose is to compute vector-to-code distances efficiently, when one vector is compared with many codes.
In that case, it is often possible to precompute a set of tables to compute the distances directly in the compressed domain.
Therefore, the DistanceComputer
offers:
-
a
set_query
method that sets the current uncompressed vector to compare with -
a
distance_to_code
method that computes the actual distance with a given code.
Example: PQ encoding / decoding
The ProductQuantizer
object can be used to encode or decode vectors to codes:
d = 32 # data dimension
cs = 4 # code size (bytes)
# train set
nt = 10000
xt = np.random.rand(nt, d).astype('float32')
# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')
pq = faiss.ProductQuantizer(d, cs, 8)
pq.train(xt)
# encode
codes = pq.compute_codes(x)
# decode
x2 = pq.decode(codes)
# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()
A scalar quantizer works similarly:
d = 32 # data dimension
# train set
nt = 10000
xt = np.random.rand(nt, d).astype('float32')
# dataset to encode (could be same as train)
n = 20000
x = np.random.rand(n, d).astype('float32')
# QT_8bit allocates 8 bits per dimension (QT_4bit also works)
sq = faiss.ScalarQuantizer(d, faiss.ScalarQuantizer.QT_8bit)
sq.train(xt)
# encode
codes = sq.compute_codes(x)
# decode
x2 = sq.decode(codes)
# compute reconstruction error
avg_relative_error = ((x - x2)**2).sum() / (x ** 2).sum()