Codebook Training Script

The training involves vstar_train_codebook.py (stored in tools/train in the installation directory). Note that the Python version is 3.9.

Command Reference

python3 vstar_train_codebook.py --dataPath <data_path> --dim <dim> --codebookPath <codebook_output_dir> --nlistL1 <nlist1> --subDimL1 <sub_dim1> --device <device> --batchSize <batch_size> --sample <sample> --useOfflineCompile

Parameter

<data_path>: path of the original data of the training codebook. Ensure that the data exists. This parameter is required.

<dim>: feature vector dimension. The value must be the same as that generated by the training operator model file in VSTAR. The default value is 256.

<codebook_output_dir>: directory for storing the generated codebook file. Ensure that the directory exists and the user who runs programs has the write permission on this directory. For security hardening, the directory level cannot contain soft links.

<nlist1>: number of level-1 cluster centroids. The value must be the same as that generated by the training operator model file in VSTAR. The default value is 1024.

<sub_dim1>: dimension size after L1 dimension reduction. The value must be the same as that generated by the training operator model file in VSTAR. The default value is 32.

<device>: logical ID of a device. Training is performed on a specified device. The default value is 1.

<batch_size>: The training is performed based on batch size. The value range is (0, 10240], and the default value is 10240.

<sample>: sampling rate of original samples used for training. The value range is (0, 1.0], and the default value is 1.0.

--useOfflineCompile: whether to use enable offline operator compilation, to improve performance. This function is disabled by default. If this function is enabled, add this option to the end of the command line. For details, see --useOfflineCompile Option description.

--help | -h: help information.

Instruction

  • <data_path>: The original data size must be less than or equal to 10 million 1024-dimensional data records, that is, 10,000,000 × 1024 × 4 = 40,960,000,000.
  • After this command is executed, a new directory codebook_<dim>_<nlist1>_<sub_dim1>.bin is generated in the <codebook_output_dir> directory, which is the codebook file required by AscendIndexVStar and AscendIndexGreat.
  • When a codebook file exists, it will be overwritten. In this case, the program running user should be the owner of the file.
  • Before executing the generated training codebook, generate a training operator model file by referring to VSTAR.