Rand Index & Adjusted Rand Index

RI是一种直观的反应归档结果精度的指标，对于原始的数据集中，我们任选两个点p和q，观察他们两个在聚类中的结果，可能会有四种情况：

TP：原始数据集中p和q属于同一类，聚类的结果中p和q属于同一类。
FN：原始数据集中p和q属于同一类，聚类的结果中p和q不属于同一类。
FP：原始数据集中p和q不属于同一类，聚类的结果中p和q属于同一类。
TN：原始数据集中p和q不属于同一类，聚类的结果中p和q不属于同一类。

类似深度学习中的混淆矩阵，我们可以通过计算(TP+TN)/(TP+TN+FP+FN)的方法来计算推理结果的精度，也可以推广到Recall等等。

实际的使用过程中，对于任意的两个点p和q，最大的可能就是原始数据集中p和q不属于同一类，而聚类的结果中p和q中也不属于同一类，也就是说分子和分母中，TN都占了绝大多数的比例，以至于这个数字干扰了我们查看正常的聚类结果，我们可以采用赋权重来调节TN所占的比例。

Adjusted Rand Index亦可以解决上面所说的TN过高的问题，其范围是[-1, 1]，越高表示聚类的结果越好，如果聚类的结果倾向于随机分布则会接近0，如果我们的聚类算法总能得到不正确的结果，对于GroundTruth中同一个类的特征向量倾向于将其分入不同的类，对于GroundTruth中不同类的特征向量倾向于将其分为同一个类，那么我们可以得到一个接近-1的Adjusted Rand Index。对于某个数据集上特定一组参数的聚类结果，可以看出Adjusted Rand Index通常比Rand Index更加灵敏，我们更推荐用户使用该指标对聚类结果进行精度评估。

```
from sklearn import metrics
def CalcRI(allClusters, groundTruth, thresh = 99999999):
    '''
    Args
        allClusters: list of list, represent all clusters
        groundTruth: groundtruth dict
        thresh: blackhole archive threshold, if one single archive has features more than this value, ignore such archive
    '''
    filteredClusters = []
    for arc in allClusters:
        if len(arc) < thresh:
            filteredClusters.append(arc)
    tmpDict = {}
    for label, cluster in enumerate(filteredClusters):
        for feat in cluster:
            tmpDict[feat] = label
    featLis = list(tmpDict.keys())
    featLis.sort()
    pd = []
    gt = []
    for feat in featLis:
        pd.append(tmpDict[feat])
        gt.append(int(groundTruth[feat]))
    print("Start to calculate RI")
    score = metrics.rand_score(pd, gt)
    print("Rand Score {}".format(score))
    score = metrics.adjusted_rand_score(pd, gt)
    print("Adjusted Rand Score {}".format(score))
```

父主题： 聚类精度评估方式参考