FeatureClustering Overview

FeatureClustering is performed on the Ascend platform and supports various unsupervised learning scenarios with tens of millions of clusterings. Currently, FeatureClustering supports three functions: distance-based feature archiving, archive merging, and K-Means feature archiving based on the number of categories.

FeatureClustering feature archiving (based on the distance): Archiving based on vector feature similarity is supported. When the similarity between two features meets the given threshold, the two features are considered to share the same ID and are grouped into the same archive.
The input of FeatureClustering feature archiving is n pieces of Float32 feature vector data whose dimension is dim and the corresponding feature vector index whose length is n and data format is uint64. The output is a FeatureClustering result. The input and output data can be transferred in the memory through the APIs or be serialized based on the predefined protobuf message.
FeatureClustering archive merging: It is used to merge the results obtained from the distance-based feature archiving.
The input for archive merging is multiple FeatureClustering results. When the similarity between two files meets specified conditions, the files are merged and a FeatureClustering result is generated. The input and output data can be transferred in the memory through the APIs or be serialized based on the predefined protobuf message.

If the data scale is large and it is difficult to archive all data at a time, you can split the original feature vector data and the corresponding feature vector indexes, archive the data separately, and merge the archived FeatureClustering results. Archive merging is used to merge archives in common scenarios. You can also merge archives based on your service policy or algorithm requirements.

Figure 1 FeatureClustering archiving

Figure 2 FeatureClustering archive merging

FeatureClustering K-Means clustering (based on the number of clustering rounds): Feature archiving and archive merging are performed on feature vectors by distance, requiring fine granularity for data processing and large memory. To perform clustering operations on massive volumes of data with the limited memory, the new FeatureClustering component provides AscendKMeans objects, helping users perform archiving and clustering operations based on the total number of categories of feature vectors. Generally, compared with the distance-based feature clustering, K-Means clustering uses less memory and has better performance, but the accuracy varies depending on the dataset. Whether to use K-Means clustering depends on the site requirements.

Parent topic: Introduction