Database Creation

IVFast index contains a self-supervised AI clustering model. AI clustering model training requires GPU devices (such as V100 or T4) and can be performed only once before delivery. To ensure better generalization, you are advised to use a large training dataset, for example, with 10M features of 256 dimensions at FP32. One or more different datasets can be created as the validation datasets, including base (25M features of 256 dimensions at fp32), query (N features of 256 dimensions at fp32), and GT. The data directory is as follows:

data_path
   |--- learn10m.npy  // "learn10m" is the name of the training dataset.
   |--- data_name1    // Name of the validation dataset 1
   |--- base.npy
   |--- query.npy
   |--- gt.npy
   |--- data_name2    // Name of the validation dataset 2
   |--- base.npy
   |--- query.npy
   |--- gt.npy

All data must be normalized before being saved. You can use the Faiss normalization mode normalize_L2(learn). Ensure that the norm of all vectors is 1 and the IP distance calculation mode is used.

Table 1 Terms and definitions

Terminology

Description

learn10m

Data used for AI model training. The size is 10000000 x 256 and the data type is float32.

base

Used for creating the base library. The size is 25000000 x 256 and the data type is float32.

query

Query features to be retrieved. The size is 10000 x 256 and the data type is float32.

gt

Base library offset of the query features that belong to the same object in the 256-dimensional library. The size is N x 100 and the data type is int64.

Currently, the API for reading data is np.load. Therefore, the preceding four data needs to be saved in NPY format. The following is an example for saving features:

from faiss import normalize_L2
normalize_L2(learn10m)
normalize_L2(base)
normalize_L2(query)
np.save("learn10m.npy", learn.astype("float32"))
np.save("base.npy", base.astype("float32"))
np.save("query.npy", query.astype("float32"))
np.save("gt.npy", gt.astype("int64"))