Archive Rate

The archive rate is a value closely related to datasets. In other words, if the data in a dataset is closely related, the archive rate is high. Otherwise, the archive rate is low. Determine whether to use the archive rate for evaluation as required, because the evaluation result may be inaccurate if the archive rate is used.

Assume there is a dataset containing three classes A, B, and C, three A feature vectors, one B feature vector, and one C feature vector. When an ideal clustering model is used, that is, the clustering result is always correct, the archive rate is 3/5 = 60%. That is, only 60% features are archived, and the remaining features are isolated. In this case, the archive rate can be considered an accurate value. However, when a clustering model with poor effect is used, that is, all data is grouped into one cluster, the archive rate is always 100%. Though the archive rate is higher, it does not indicate better accuracy.

```
def RatioOfClusteredFeatures(res):
    '''
    Args
        res: predefined protobuf-deserialized result
    '''
    numArchives = 0
    numUnarchives = len(res.unarchivedFeatures)
    for i in res.archives:
        numArchives += len(i)
    print("Total number of features to be clustered: {}".format(numArchives + numUnarchives))
    print("Clustered result has {} archives, {} features clustered.".format(len(res.archives), numArchives))
    print("Clustered result has {} un-archived features".format(numUnarchives))
    print("Total archive rate {}%".format(100 * numArchives / (numArchives + numUnarchives)))
```

Parent topic: Clustering Accuracy Evaluation Method