Result Description

After distributed training is complete, you can check the execution result and locate faults by referring to this section.

Training Result Check

  1. Check your script execution result.

    The print result varies according to training scripts. If information similar to the following is displayed on each device for distributed training, the training is complete.

    When the environment variable DUMP_GE_GRAPH is enabled, GE dump graph files are generated.

    1
    export DUMP_GE_GRAPH=2
    

    If the HcomBroadcast and HcomAllReduce operators are found in the directory of the dumped graph files, it indicates that the HCCL operators for inter-NPU communication have been properly inserted.

    Figure 1 Dumped graphs from GE
  2. If your script fails to execute, analyze and locate the fault in the same way you do in single-device training.

    You can spot the fault by checking the host log file plog_*.log in $HOME/ascend/log/run/plog where $HOME is the root directory of the host user.

    If the execution succeeds on a single device but fails on multiple devices, the issue is typically related to collective communication, as shown in Figure 2. For details, see section "FAQs" in HCCL User Guide.

    Figure 2 Collective communication issue

Troubleshooting

If the script execution fails, analyze and locate the fault based on the following logs:

Path of run logs generated when the app is running on the host: $HOME/ascend/log/run/plog/plog-pid_*.log.

Path of the run logs generated when the app is running on the device: $HOME/ascend/log/run/device-id/device-pid_*.log.

$HOME indicates the root directory of the user on the host.

For more information, see Log Reference.

You can identify the error module and determine the cause by using ERROR-level logs.

Figure 3 Error log example
Table 1 Fault location techniques

Module Name

Error

Solution

System error

Environment and version mismatch

Check the version mapping and system installation.

GE

GE graph build or verification error

Specific error causes are provided for verification errors. You only need to modify the network script as prompted.

Runtime

Initialization or graph execution failure due to an environment exception

If initialization fails, check the environment configuration and whether the environment is occupied by other processes.