Purity

对于聚类的结果，如我们对来自A、B、C这3个GT类的10个特征向量进行聚类，得到如下的两个档案。

类1：A01，A02，A03，A04，B01 -> 该类标记为A
类2：B02，B03，B04，B05，C01 -> 该类标记为B

那么我们可以计算其Purity为：(A类中A的特征向量数量 + B类中B的特征向量数量) / (A类中特征向量总数 + B类中特征向量的总数)，即：(4 + 4) / (5 + 5) = 80%

```
def CalPurity(allClusters, gt):
    '''
    Args
        allClusters: list of list, represent all clusters
        gt: groundtruth dict
    Return
        the label of majority features in this cluster
    '''
    correctPredictions = 0
    allPredictions = 0
    for cluster in allClusters:
        label = GetLabel(cluster, gt)
        allPredictions += len(cluster)
        for feature in cluster:
            if gt[feature] == label:
                correctPredictions += 1
print("Purity value is {} / {} = {}%".format(correctPredictions, allPredictions, 100 * correctPredictions / allPredictions))
```

Purity指标本身对于一类多档具有一定的倾向性，如果一个算法/模型过分追求Purity，那么就很容易将原始的一个类分裂为多个类，保持类内部的高内聚性，从而满足较高的Purity，但是这在全局的表现上则通常是不满足使用场景的。

父主题： 聚类精度评估方式参考