Purity
Assume that 10 feature vectors from three GT classes A, B, and C are clustered, and the following two archives are obtained.
- Cluster 1: A01, A02, A03, A04, and B01 -> This cluster is marked as A.
- Cluster 2: B02, B03, B04, B05, and C01 -> This cluster is marked as B.
Then we can calculate the purity using the following formula: Purity = (Number of A feature vectors in cluster A + Number of B feature vectors in cluster B)/(Total number of feature vectors in cluster A + Total number of feature vectors in cluster B). That is, (4 + 4)/(5 + 5) = 80%.
```
def CalPurity(allClusters, gt):
'''
Args
allClusters: list of list, represent all clusters
gt: groundtruth dict
Return
the label of majority features in this cluster
'''
correctPredictions = 0
allPredictions = 0
for cluster in allClusters:
label = GetLabel(cluster, gt)
allPredictions += len(cluster)
for feature in cluster:
if gt[feature] == label:
correctPredictions += 1
print("Purity value is {} / {} = {}%".format(correctPredictions, allPredictions, 100 * correctPredictions / allPredictions))
```
The purity works with a tendency towards grouping one class into multiple clusters. When an algorithm or model requires high purity, it is easy to split an existing cluster into multiple clusters to maintain high cohesion within the cluster. However, this generally does not suit the application scenarios in terms of overall performance.
Parent topic: Clustering Accuracy Evaluation Method