Dataset Construction in the Clustering Scenario

To construct a dataset in the clustering scenario, place the extracted features of each image in the corresponding ID directory, and then divide the dataset by performing the following operations.

  1. Folder filtering: Filters out the ID folders with fewer than three samples to ensure that the ID can be used to extract the query.
  2. Learn & Base: Randomly selects 10% of each ID as the Learn and 80% as the base, records the base actual ID as base_dict, and randomly selects one ID as the candidate query to ensure that the three parts of data do not overlap.
  3. Query selection: Randomly selects 20,000 IDs as the object category of the query, combines the query candidates of these IDs into query data to obtain 20,000 query vectors, and records the actual query ID as query_dict.
  4. Normalization: Uses normalize_L2 of Faiss to normalize data.
  5. GT generation: Uses faiss.GpuIndexFlatIP to forcibly calculate the IP distance between the query and base. The base index (starting from 0) closest to the query is selected as the label.
    Reference code:
    res = faiss.StandardGpuResources()
    config = faiss.GpuIndexFlatConfig()
    config.device = gpu_id
    index = faiss.GpuIndexFlatIP(res, dim, config)
    index.add(base)
    dist, I = index.search(query_features, 1)
    
  6. Query filtering: After obtaining the labels of all queries, checks whether the actual query ID is the same as the actual base ID corresponding to GT. If yes, the query is reserved. Otherwise, filter out the query and the corresponding GT to ensure that the GT of the query is reliable.

    Reference code:

    query_to_delete = []
    for label, (cls, top1) in enumerate(zip(query_cls, I)):
            id = top1[0]
            if id not in base_dict[cls]:
                query_to_delete.append(label)