Training Script Execution

run_dcnet.py is executed for the training. The training script is stored in tools/train in the installation directory. The Python version is 3.7.5, and the mixed precision mode is used by default during the training.

File	python3 run_dcnet.py --save_path <save path> --data_path <dataset path> --data_name <eval dataset name> --learn_name <train dataset name> --warmup <warmup epochs> --epochs <training epochs> --metric <metric type> --batch_size <batch size> --valid_batch_size <valid batch size> --start_epoch <start epoch> --lr <learning rate> --use_amp <whether use amp> --apex_mode <apex mode> --scale <scale> --ftopk <fuzzy topk> --param_list <param list> --use_drop <use drop> --drop_prop <drop prop> --nlist <coarse centroid num> --niter <iterations> --nprobes <nprobes> --percents <fuzzy percentages> --gpu <gpu id> --seed <seed> --log_interval <log interval> --eval <whether to eval> --ai_center <trained model path>
Parameter	<save path>: storage path. By default, this parameter is left blank and must be specified. <dataset path>: Path of all data, including the training dataset and validation dataset. This parameter must be specified. <eval dataset name>: Name of the validation dataset (including base, query, and gt). This parameter must be specified. <train dataset name>: Name of the training dataset (for example, the name of learn10m.npy is learn10m). This parameter must be specified. <warmup epochs>: learning rate adjustment policy. The default value of the number of warm-up epochs is 3. <training epochs>: number of training epochs. The default value is 20. <metric type>: distance calculation method. The value can be IP or L2. The default value is IP. <batch size>: trains the batch size. The default value is 2048. <valid batch size>: evaluates the batch size. The default value is 10000. <start epoch>: initial epoch for resuming the PyTorch model. The default value is 0. <learning rate>: initial learning rate. The default value is 0.0005. This parameter must be specified. <whether use amp>: whether to use mixed precision training. The default value is False. If you want to use mixed precision training, you must set it to True. <apex mode>: mixed precision mode. The value can be O1 or O2. The default value is O2. <scale>: scale for mixed precision training. The default value is 1024. <fuzzy topk>: training parameter. The default value is 10. <param list >: Model structure definition. The default value is [256, 512, 1024, 2048, 8192, 16384]. The first input element is the data dimension (256 dimensions by default), and the last output element is the number of nlists (16384 by default). <use drop>: training policy. The default value is False, but True is recommended. <drop prop>: proportion when drop is used. The default value is 0.05. <coarse centroid num>: number of L1 cluster centers. The default value is 16384. <iterations>: training parameter. The default value is 128. <nprobes >: retrieval evaluation. The default value is [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 16, 20, 32]. <percents>: redundancy ratio of the base library. The recommended value range is [0.8, 1.0, 1.25, 1.50]. This parameter must be specified. <gpu id>: training resource ID. The default value is 0. If there is no GPU, set this parameter to -1, indicating the CPU is used. This parameter must be specified. <seed>: defaults to 1. <log interval>: number of log intervals in the training phase. The default value is 500. <eval>: model evaluation. You can skip training and directly load a model to evaluate different validation datasets. The default value is False. If you want to use this function, set it to True. <ai_center>: model path for model evaluation.
Usage	You can run this command to perform an IVFast training test. The following is usage examples: 1) Model training: python3 run_dcnet.py --save_path ./save_path/ --data_path /data1/AscendFaissTestData/ --data_name "webfaceLPS" --learn_name learn10m --warmup 3 --epochs 20 --lr 1e-4 --param_list 256 512 1024 2048 8192 16384 --use_drop True --drop_prop 0.05 --percents 0.0 0.8 1.0 1.25 1.50 --use_amp True --gpu 3 2) Model evaluation: python3 run_dcnet.py --save_path ./save_path/ --data_path /data1/AscendFaissTestData/ --data_name "webfaceST" --learn_name learn10m --eval True --ai_center ./save_path/ivfast.npy --gpu 3 --percents 0.0 0.8 1.0 1.25 1.50

File

python3 run_dcnet.py --save_path <save path> --data_path <dataset path> --data_name <eval dataset name> --learn_name <train dataset name> --warmup <warmup epochs> --epochs <training epochs> --metric <metric type> --batch_size <batch size> --valid_batch_size <valid batch size> --start_epoch <start epoch> --lr <learning rate> --use_amp <whether use amp> --apex_mode <apex mode> --scale <scale> --ftopk <fuzzy topk> --param_list <param list> --use_drop <use drop> --drop_prop <drop prop> --nlist <coarse centroid num> --niter <iterations> --nprobes <nprobes> --percents <fuzzy percentages> --gpu <gpu id> --seed <seed> --log_interval <log interval> --eval <whether to eval> --ai_center <trained model path>

Parameter

<save path>: storage path. By default, this parameter is left blank and must be specified.

<dataset path>: Path of all data, including the training dataset and validation dataset. This parameter must be specified.

<eval dataset name>: Name of the validation dataset (including base, query, and gt). This parameter must be specified.

<train dataset name>: Name of the training dataset (for example, the name of learn10m.npy is learn10m). This parameter must be specified.

<warmup epochs>: learning rate adjustment policy. The default value of the number of warm-up epochs is 3.

<training epochs>: number of training epochs. The default value is 20.

<metric type>: distance calculation method. The value can be IP or L2. The default value is IP.

<batch size>: trains the batch size. The default value is 2048.

<valid batch size>: evaluates the batch size. The default value is 10000.

<start epoch>: initial epoch for resuming the PyTorch model. The default value is 0.

<learning rate>: initial learning rate. The default value is 0.0005. This parameter must be specified.

<whether use amp>: whether to use mixed precision training. The default value is False. If you want to use mixed precision training, you must set it to True.

<apex mode>: mixed precision mode. The value can be O1 or O2. The default value is O2.

<scale>: scale for mixed precision training. The default value is 1024.

<fuzzy topk>: training parameter. The default value is 10.

<param list >: Model structure definition. The default value is [256, 512, 1024, 2048, 8192, 16384]. The first input element is the data dimension (256 dimensions by default), and the last output element is the number of nlists (16384 by default).

<use drop>: training policy. The default value is False, but True is recommended.

<drop prop>: proportion when drop is used. The default value is 0.05.

<coarse centroid num>: number of L1 cluster centers. The default value is 16384.

<iterations>: training parameter. The default value is 128.

<nprobes >: retrieval evaluation. The default value is [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 12, 16, 20, 32].

<percents>: redundancy ratio of the base library. The recommended value range is [0.8, 1.0, 1.25, 1.50]. This parameter must be specified.

<gpu id>: training resource ID. The default value is 0. If there is no GPU, set this parameter to -1, indicating the CPU is used. This parameter must be specified.

<seed>: defaults to 1.

<log interval>: number of log intervals in the training phase. The default value is 500.

<eval>: model evaluation. You can skip training and directly load a model to evaluate different validation datasets. The default value is False. If you want to use this function, set it to True.

<ai_center>: model path for model evaluation.

Usage

You can run this command to perform an IVFast training test. The following is usage examples:

1) Model training:

python3 run_dcnet.py --save_path ./save_path/ --data_path /data1/AscendFaissTestData/ --data_name "webfaceLPS" --learn_name learn10m --warmup 3 --epochs 20 --lr 1e-4 --param_list 256 512 1024 2048 8192 16384 --use_drop True --drop_prop 0.05 --percents 0.0 0.8 1.0 1.25 1.50 --use_amp True --gpu 3

2) Model evaluation:

python3 run_dcnet.py --save_path ./save_path/ --data_path /data1/AscendFaissTestData/ --data_name "webfaceST" --learn_name learn10m --eval True --ai_center ./save_path/ivfast.npy --gpu 3 --percents 0.0 0.8 1.0 1.25 1.50

During model training, the training log, model weights, and intermediate results are saved in save_path. The AI clustering model used in IVFast-index is /save_path/ivfast.npy.
During model evaluation, you can directly load the trained ivfast.npy, different validation datasets, and different redundancy ratios to evaluate generalization and determine whether to use the model.
To improve the training efficiency, the mixed precision O2 mode (use_amp=True) is used by default. The GPU does not have the mixed precision exception handling mechanism. If loss = NaN occurs during the training, you are advised to reduce the learning rate, for example, from 5e-4 to 1e-4, or disable the mixed precision mechanism (the training time will be prolonged).

Parent topic: IVFast Training APIs