Training Script Execution
run_dcnet.py is executed for the training. The training script is stored in tools/train in the installation directory. The Python version is 3.7.5, and the mixed precision mode is used by default during the training.
File |
python3 run_dcnet.py --save_path <save path> --data_path <dataset path> --data_name <eval dataset name> --learn_name <train dataset name> --warmup <warmup epochs> --epochs <training epochs> --metric <metric type> --batch_size <batch size> --valid_batch_size <valid batch size> --start_epoch <start epoch> --lr <learning rate> --use_amp <whether use amp> --apex_mode <apex mode> --scale <scale> --ftopk <fuzzy topk> --param_list <param list> --use_drop <use drop> --drop_prop <drop prop> --nlist <coarse centroid num> --niter <iterations> --nprobes <nprobes> --percents <fuzzy percentages> --gpu <gpu id> --seed <seed> --log_interval <log interval> --eval <whether to eval> --ai_center <trained model path> |
|---|---|
Parameter |
<save path>: storage path. By default, this parameter is left blank and must be specified. <dataset path>: Path of all data, including the training dataset and validation dataset. This parameter must be specified. <eval dataset name>: Name of the validation dataset (including base, query, and gt). This parameter must be specified. <train dataset name>: Name of the training dataset (for example, the name of learn10m.npy is learn10m). This parameter must be specified. <warmup epochs>: learning rate adjustment policy. The default value of the number of warm-up epochs is 3. <training epochs>: number of training epochs. The default value is 20. <metric type>: distance calculation method. The value can be IP or L2. The default value is IP. <batch size>: trains the batch size. The default value is 2048. <valid batch size>: evaluates the batch size. The default value is 10000. <start epoch>: initial epoch for resuming the PyTorch model. The default value is 0. <learning rate>: initial learning rate. The default value is 0.0005. This parameter must be specified. <whether use amp>: whether to use mixed precision training. The default value is False. If you want to use mixed precision training, you must set it to True. <apex mode>: mixed precision mode. The value can be O1 or O2. The default value is O2. <scale>: scale for mixed precision training. The default value is 1024. <fuzzy topk>: training parameter. The default value is 10. <param list >: Model structure definition. The default value is [256, 512, 1024, 2048, 8192, 16384]. The first input element is the data dimension (256 dimensions by default), and the last output element is the number of nlists (16384 by default). <use drop>: training policy. The default value is False, but True is recommended. <drop prop>: proportion when drop is used. The default value is 0.05. <coarse centroid num>: number of L1 cluster centers. The default value is 16384. <iterations>: training parameter. The default value is 128. <nprobes >: retrieval evaluation. The default value is [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 16, 20, 32]. <percents>: redundancy ratio of the base library. The recommended value range is [0.8, 1.0, 1.25, 1.50]. This parameter must be specified. <gpu id>: training resource ID. The default value is 0. If there is no GPU, set this parameter to -1, indicating the CPU is used. This parameter must be specified. <seed>: defaults to 1. <log interval>: number of log intervals in the training phase. The default value is 500. <eval>: model evaluation. You can skip training and directly load a model to evaluate different validation datasets. The default value is False. If you want to use this function, set it to True. <ai_center>: model path for model evaluation. |
Usage |
You can run this command to perform an IVFast training test. The following is usage examples: 1) Model training: python3 run_dcnet.py --save_path ./save_path/ --data_path /data1/AscendFaissTestData/ --data_name "webfaceLPS" --learn_name learn10m --warmup 3 --epochs 20 --lr 1e-4 --param_list 256 512 1024 2048 8192 16384 --use_drop True --drop_prop 0.05 --percents 0.0 0.8 1.0 1.25 1.50 --use_amp True --gpu 3 2) Model evaluation: python3 run_dcnet.py --save_path ./save_path/ --data_path /data1/AscendFaissTestData/ --data_name "webfaceST" --learn_name learn10m --eval True --ai_center ./save_path/ivfast.npy --gpu 3 --percents 0.0 0.8 1.0 1.25 1.50 |
- During model training, the training log, model weights, and intermediate results are saved in save_path. The AI clustering model used in IVFast-index is /save_path/ivfast.npy.
- During model evaluation, you can directly load the trained ivfast.npy, different validation datasets, and different redundancy ratios to evaluate generalization and determine whether to use the model.
- To improve the training efficiency, the mixed precision O2 mode (use_amp=True) is used by default. The GPU does not have the mixed precision exception handling mechanism. If loss = NaN occurs during the training, you are advised to reduce the learning rate, for example, from 5e-4 to 1e-4, or disable the mixed precision mechanism (the training time will be prolonged).