Purity

Assume that 10 feature vectors from three GT classes A, B, and C are clustered, and the following two archives are obtained.

  • Cluster 1: A01, A02, A03, A04, and B01 -> This cluster is marked as A.
  • Cluster 2: B02, B03, B04, B05, and C01 -> This cluster is marked as B.

Then we can calculate the purity using the following formula: Purity = (Number of A feature vectors in cluster A + Number of B feature vectors in cluster B)/(Total number of feature vectors in cluster A + Total number of feature vectors in cluster B). That is, (4 + 4)/(5 + 5) = 80%.

```
def CalPurity(allClusters, gt):
    '''
    Args
        allClusters: list of list, represent all clusters
        gt: groundtruth dict
    Return
        the label of majority features in this cluster
    '''
    correctPredictions = 0
    allPredictions = 0
    for cluster in allClusters:
        label = GetLabel(cluster, gt)
        allPredictions += len(cluster)
        for feature in cluster:
            if gt[feature] == label:
                correctPredictions += 1
print("Purity value is {} / {} = {}%".format(correctPredictions, allPredictions, 100 * correctPredictions / allPredictions))
```

The purity works with a tendency towards grouping one class into multiple clusters. When an algorithm or model requires high purity, it is easy to split an existing cluster into multiple clusters to maintain high cohesion within the cluster. However, this generally does not suit the application scenarios in terms of overall performance.