Clustering Parameter Configuration

Archive/Merge/KMeans Parameters

Set ArchiveMode to ARCHIVE, MERGE, or KMEANS by referring to the table "Common Archive/Merge parameters". FeatureClustering can either archive or merge archives at one time.

**Table 1** Common Archive/Merge parameters
Parameter	Default Value	Possible Value	Description
ArchiveName	FeatureClustering	-	Name of an archiving/archive merging task. It is used to name and mark an archiving/archive merging operation. The name can contain a maximum of 512 characters.
ArchiveMode	ARCHIVE	ARCHIVE/MERGE/KMEANS	Archive mode. ARCHIVE: archiving MERGE: archive merging KMEANS: archiving massive data of the K-Means clustering
FeatureDim	512	-	Feature dimension for archiving/archive merging. FeatureRetrieval must support feature retrieval in the specified dimension.
FeatureDataType	FP32	FP32	Data type of the feature vector. Only FP32 is supported.
FeatureSource	RANDOM	RANDOM/FILE/INTERFACE	Archive feature data source, which can be randomly generated, comes from binary feature/ID files, or from the feature/ID data memory addresses in API calls. (In the archive merging mode, randomly generated features cannot be used as the input.) You are advised to set this parameter to Random only when the value of FeatureCount is smaller than 1,000,000 because the distribution and performance of randomly generated data differ greatly from those of data generated in the actual application scenario. Currently, Random generates data by using a fixed seed. Therefore, if other parameters (such as thresholds) remain unchanged, the randomly generated data result can be repeatedly verified.
MetricType	IP	IP	Feature vector similarity measurement. Currently, only the distance similarity of an inner product is supported.
ExactSearchIndexType	FLAT	FLAT/SQ	Type of the index in the small library retrieval algorithm configured on the NPU.
ExactSearchThreshold	50000	-	Threshold for enabling the small library retrieval algorithm on the NPU. The NPU retrieval consumes more resources than the CPU. Therefore, when the data size is less than the threshold, the CPU instead of the NPU is used for retrieval.
ApproximateSearchIndexType	NONE	NONE/IVFSQ/IVFSQC	Type of the index in the large library retrieval algorithm configured on the NPU. The value NONE indicates that the large library retrieval algorithm is not used.
ApproximateSearchThreshold	5000000	-	Threshold for enabling the large library retrieval algorithm on the NPU. When the data size is small, the small library retrieval performs better than the large library retrieval. Therefore, when the large library retrieval algorithm is available and the data size is larger than the threshold, the large library retrieval algorithm is enabled.
Nlist	1024	-	A parameter of the large library retrieval algorithm configured on the NPU, which indicates the number of nlists of the IVF. FeatureRetrieval must support the large library retrieval algorithm for this value.
Nprobe	32	-	Parameter of the large library retrieval algorithm on the NPU, which indicates the number of nprobes of the IVF.
FuzzK	5	An integer greater than 1 and less than 10	FuzzK threshold used when the IVFSQC algorithm is used. This parameter affects the maximum number of features that can be stored on a single processor. For details, see the MindX SDK mxIndex FeatureRetrieval User Guide.
FuzzThreshold	1.003	A floating point number that is greater than 1 and less than 2, containing a maximum of three valid digits after the decimal point.	FuzzThreshold value used when the IVFSQC algorithm is used. This parameter affects the maximum number of features that can be stored on a single processor. For details, see the MindX SDK mxIndex FeatureRetrieval User Guide.
TrainIter	16	An integer greater than 0 and less than or equal to 512	Number of iterations during IVFSQC training. For details, see the MindX SDK mxIndex FeatureRetrieval User Guide.
DimReduction	TRUE	TRUE/FALSE	Whether dimension reduction is required for the large library retrieval algorithm on the NPU
ShortDim	64	-	Parameter of the large library retrieval algorithm on the NPU, which indicates the reduced dimensionality of the input. FeatureRetrieval must support reduction of this dimensionality.
Devices	-	-	Processor ID used by the NPU. Use commas (,) to separate processor IDs. For details, run the npu-smi info command.
ThreadNum	16	-	Number of threads for executing jobs. This parameter is used to increase the degree of parallelism for program running.
ResourcesSize	128	-	Size of the memory pool used by the NPU, in MB

**Table 2** Archive parameters
Parameter	Default Value	Possible Value	Description
FeatureCount	10000	-	Number of features to be archived When data is generated randomly, the value of FeatureCount cannot be set to 5000000.
NeedNormalization	TRUE	TRUE/FALSE	Whether a feature needs to be normalized by model. (The inner product measurement method for distance calculation needs to be normalized by model first.) The current INT8 data does not support quantization.
PointPointThreshold	0.875	-	Threshold of the similarity between points (features) to be clustered
PointClusterThreshold	0.7	-	Threshold of the similarity between points and clusters
ClusterClusterThreshold	0.8	-	Threshold of the similarity between clusters
MinRankDistance	6	-	Minimum sorting distance
MaxRankDistance	10	-	Maximum sorting distance
MinPicNum	2	-	Minimum number of vectors in an archive. If the number of features in an archive is less than the value of this parameter after archiving, all features in the archive are set as outliers.
MaxCoverNum	1	-	Number of vectors in the cover archive (This parameter has no impact on MindX clustering. It is a reserved parameter for future use.)

**Table 3** Merge parameters
Parameter	Default Value	Possible Value	Description
ArchiveResultMergeThreshold	0.6	-	Threshold of the similarity between different archiving results. This parameter is used to merge archiving results in the archive merging scenario.
MergeArchivesCount	0	-	Total number of archives to be merged

**Table 4** KMean parameters
Parameter	Default Value	Possible Value	Description
FeatureCount	10000	-	Total number of feature vectors to be archived in the K-Means clustering and archiving scenarios. Generally, the K-Means clustering mode is recommended when the clustering scale is greater than 10 million. A single processor supports a maximum of 25 million feature vectors in a base library.
KMeansTimes	6	-	Number of K-Means clustering rounds. The minimum value is 1. The final clustering result is obtained by combining the results of multiple K-Means clustering rounds.
ArchiveNum	1000000	-	Estimated value of the clustering numbers if the number of feature vectors is equal to the value of FeatureCount. If there is no specific reference value, set this parameter to one tenth of the value of FeatureCount, and then adjust the parameter based on the value. (The final clustering result is obtained from multiple rounds of clusterings. Therefore, the total number of final clustering categories may be different from the configured value of ArchiveNum. The value of ArchiveNum is used only as a reference value in a single K-Means clustering.)
TopK	100	-	TopK feature vectors saved for retrieval of all feature vectors. A larger value of TopK indicates higher accuracy, but may slightly deteriorate the performance. Set this parameter as required.
MaxKMeansIterTimes	30	-	Maximum number of iterations of a single K-Means clustering. If the number of iterations exceeds the maximum number, K-Means clustering ends immediately.
MinFreqKMeans	3	-	In the combination result of multiple rounds of K-Means clustering, if two points belong to the lowest frequency of the same class in the final clustering result, the value of this parameter cannot exceed the value of KMeansTimes.
MaxFreqIso	3	-	In the combination result of multiple rounds of K-Means clustering, if one or both of two points are isolated, and the two points belong to the highest frequency of the same class in the final clustering result, the value cannot exceed the value of KMeansTimes, but it should be greater than or equal to the value of MinFreqKMeans.