Clustering Parameter Configuration

Archive/Merge/KMeans Parameters

Set ArchiveMode to ARCHIVE, MERGE, or KMEANS by referring to the table "Common Archive/Merge parameters". FeatureClustering can either archive or merge archives at one time.

Table 1 Common Archive/Merge parameters

Parameter

Default Value

Possible Value

Description

ArchiveName

FeatureClustering

-

Name of an archiving/archive merging task. It is used to name and mark an archiving/archive merging operation. The name can contain a maximum of 512 characters.

ArchiveMode

ARCHIVE

ARCHIVE/MERGE/KMEANS

Archive mode.

  • ARCHIVE: archiving
  • MERGE: archive merging
  • KMEANS: archiving massive data of the K-Means clustering

FeatureDim

512

-

Feature dimension for archiving/archive merging. FeatureRetrieval must support feature retrieval in the specified dimension.

FeatureDataType

FP32

FP32

Data type of the feature vector. Only FP32 is supported.

FeatureSource

RANDOM

RANDOM/FILE/INTERFACE

  • Archive feature data source, which can be randomly generated, comes from binary feature/ID files, or from the feature/ID data memory addresses in API calls. (In the archive merging mode, randomly generated features cannot be used as the input.)
  • You are advised to set this parameter to Random only when the value of FeatureCount is smaller than 1,000,000 because the distribution and performance of randomly generated data differ greatly from those of data generated in the actual application scenario.
  • Currently, Random generates data by using a fixed seed. Therefore, if other parameters (such as thresholds) remain unchanged, the randomly generated data result can be repeatedly verified.

MetricType

IP

IP

Feature vector similarity measurement. Currently, only the distance similarity of an inner product is supported.

ExactSearchIndexType

FLAT

FLAT/SQ

Type of the index in the small library retrieval algorithm configured on the NPU.

ExactSearchThreshold

50000

-

Threshold for enabling the small library retrieval algorithm on the NPU. The NPU retrieval consumes more resources than the CPU. Therefore, when the data size is less than the threshold, the CPU instead of the NPU is used for retrieval.

ApproximateSearchIndexType

NONE

NONE/IVFSQ/IVFSQC

Type of the index in the large library retrieval algorithm configured on the NPU. The value NONE indicates that the large library retrieval algorithm is not used.

ApproximateSearchThreshold

5000000

-

Threshold for enabling the large library retrieval algorithm on the NPU. When the data size is small, the small library retrieval performs better than the large library retrieval. Therefore, when the large library retrieval algorithm is available and the data size is larger than the threshold, the large library retrieval algorithm is enabled.

Nlist

1024

-

A parameter of the large library retrieval algorithm configured on the NPU, which indicates the number of nlists of the IVF. FeatureRetrieval must support the large library retrieval algorithm for this value.

Nprobe

32

-

Parameter of the large library retrieval algorithm on the NPU, which indicates the number of nprobes of the IVF.

FuzzK

5

An integer greater than 1 and less than 10

FuzzK threshold used when the IVFSQC algorithm is used. This parameter affects the maximum number of features that can be stored on a single processor. For details, see the MindX SDK mxIndex FeatureRetrieval User Guide.

FuzzThreshold

1.003

A floating point number that is greater than 1 and less than 2, containing a maximum of three valid digits after the decimal point.

FuzzThreshold value used when the IVFSQC algorithm is used. This parameter affects the maximum number of features that can be stored on a single processor. For details, see the MindX SDK mxIndex FeatureRetrieval User Guide.

TrainIter

16

An integer greater than 0 and less than or equal to 512

Number of iterations during IVFSQC training. For details, see the MindX SDK mxIndex FeatureRetrieval User Guide.

DimReduction

TRUE

TRUE/FALSE

Whether dimension reduction is required for the large library retrieval algorithm on the NPU

ShortDim

64

-

Parameter of the large library retrieval algorithm on the NPU, which indicates the reduced dimensionality of the input. FeatureRetrieval must support reduction of this dimensionality.

Devices

-

-

Processor ID used by the NPU. Use commas (,) to separate processor IDs. For details, run the npu-smi info command.

ThreadNum

16

-

Number of threads for executing jobs. This parameter is used to increase the degree of parallelism for program running.

ResourcesSize

128

-

Size of the memory pool used by the NPU, in MB

Table 2 Archive parameters

Parameter

Default Value

Possible Value

Description

FeatureCount

10000

-

Number of features to be archived

When data is generated randomly, the value of FeatureCount cannot be set to 5000000.

NeedNormalization

TRUE

TRUE/FALSE

Whether a feature needs to be normalized by model. (The inner product measurement method for distance calculation needs to be normalized by model first.) The current INT8 data does not support quantization.

PointPointThreshold

0.875

-

Threshold of the similarity between points (features) to be clustered

PointClusterThreshold

0.7

-

Threshold of the similarity between points and clusters

ClusterClusterThreshold

0.8

-

Threshold of the similarity between clusters

MinRankDistance

6

-

Minimum sorting distance

MaxRankDistance

10

-

Maximum sorting distance

MinPicNum

2

-

Minimum number of vectors in an archive. If the number of features in an archive is less than the value of this parameter after archiving, all features in the archive are set as outliers.

MaxCoverNum

1

-

Number of vectors in the cover archive (This parameter has no impact on MindX clustering. It is a reserved parameter for future use.)

Table 3 Merge parameters

Parameter

Default Value

Possible Value

Description

ArchiveResultMergeThreshold

0.6

-

Threshold of the similarity between different archiving results. This parameter is used to merge archiving results in the archive merging scenario.

MergeArchivesCount

0

-

Total number of archives to be merged

Table 4 KMean parameters

Parameter

Default Value

Possible Value

Description

FeatureCount

10000

-

Total number of feature vectors to be archived in the K-Means clustering and archiving scenarios. Generally, the K-Means clustering mode is recommended when the clustering scale is greater than 10 million. A single processor supports a maximum of 25 million feature vectors in a base library.

KMeansTimes

6

-

Number of K-Means clustering rounds. The minimum value is 1. The final clustering result is obtained by combining the results of multiple K-Means clustering rounds.

ArchiveNum

1000000

-

Estimated value of the clustering numbers if the number of feature vectors is equal to the value of FeatureCount. If there is no specific reference value, set this parameter to one tenth of the value of FeatureCount, and then adjust the parameter based on the value. (The final clustering result is obtained from multiple rounds of clusterings. Therefore, the total number of final clustering categories may be different from the configured value of ArchiveNum. The value of ArchiveNum is used only as a reference value in a single K-Means clustering.)

TopK

100

-

TopK feature vectors saved for retrieval of all feature vectors. A larger value of TopK indicates higher accuracy, but may slightly deteriorate the performance. Set this parameter as required.

MaxKMeansIterTimes

30

-

Maximum number of iterations of a single K-Means clustering. If the number of iterations exceeds the maximum number, K-Means clustering ends immediately.

MinFreqKMeans

3

-

In the combination result of multiple rounds of K-Means clustering, if two points belong to the lowest frequency of the same class in the final clustering result, the value of this parameter cannot exceed the value of KMeansTimes.

MaxFreqIso

3

-

In the combination result of multiple rounds of K-Means clustering, if one or both of two points are isolated, and the two points belong to the highest frequency of the same class in the final clustering result, the value cannot exceed the value of KMeansTimes, but it should be greater than or equal to the value of MinFreqKMeans.