Training Configuration

This section describes the precautions for model porting training in special scenarios.

To improve the model running speed, you are advised to use binary operators. After installing the binary OPP according to "Installing the CANN Package", run the following commands to enable the binary operators:

In the single-device scenario, modify the training entry point file, for example, main.py, and add the information in bold under import torch_npu.
1 2 3 4
import torch import torch_npu torch_npu.npu.set_compile_mode(jit_compile=False) ......

In the multi-device scenario, if the multi-device training startup mode is mp.spawn, torch_npu.npu.set_compile_mode(jit_compile=False) must be added to the main function for process startup to enable the binary operators. Otherwise, the enabling mode is the same as that in the single-device scenario.

if is_distributed:
    mp.spawn(main_worker, nprocs=ngpus_per_node, args=(ngpus_per_node, args))
else:
    main_worker(args.gpu, ngpus_per_node, args)
def main_worker(gpu, ngpus_per_node, args):
    # Add to the main function for process startup.
   torch_npu.npu.set_compile_mode(jit_compile=False)
    ......

If the user training script contains the torch.nn.DataParallel API that is not supported by the Ascend NPU platform, manually change it to torch.nn.parallel.DistributedDataParallel for multi-device training. For details, see Migrating GPU Single-device Scripts to NPU Multi-device Scripts.
If the user training script contains the amp_C module that is not supported by the NPU platform, you need to manually delete the code related to import amp_C before training.
If the user training script contains the torch.cuda.get_device_capability API, None will be returned when the script runs on the Ascend NPU platform after being migrated.

When the torch.cuda.get_device_capability API is called on the GPU platform, the GPU computing power value of the Tuple[int, int] data type is returned. However, the torch.npu.get_device_capability API of the NPU platform does not involve the concept and returns None. If an error is reported, manually change None to a fixed value of the Tuple[int, int] type.
When the torch.cuda.get_device_properties API runs on the Ascend NPU platform after being migrated, the return value does not contain the minor and major attributes. You are advised to comment out the code that invokes these two attributes.

Parent topic: Migration Training