Troubleshooting Procedure

  1. Locate the root cause of an error.

    Every AI Core error is generated during task running, but it may not be caused by abnormal execution of the task. Other exceptions such as abnormal termination of parallel tasks, task dependency input exception, and operating environment exception may also cause AI Core errors. These AI Core errors cannot be directly located by following the preceding locating method. Therefore, check whether there is error information before the AI Core error is reported.

  2. Locate the faulty operator.

    Locate the faulty operator based on the device ID, core ID, task ID, stream ID, node name, kernel name, and operator address listed in 1. Basic Information in Locating AI Core Errors.

  3. Locate the faulty instruction.

    Obtain the instruction node that stops running abnormally based on the difference between start pc and current pc provided in 3. Instructions in Locating AI Core Errors, find the corresponding instruction in the decompilation file .o.txt, and check the data storage operation performed by the instruction.

  4. Locate the error type.

    Check the error code of the AI Core error to locate the error type.

  5. Locate the specific error.
    Table 1 Troubleshooting description

    Category

    Troubleshooting

    Description

    Operator functions

    Reproduce the single-operator network.

    Construct a single-operator network based on the network specifications and data to reproduce the error. If the error can be reproduced, it is caused by the current operator node.

    Check the node information.

    Check the information about the error node in the computational graph, especially the shape inference, format, and allocated memory size. For memory overwriting errors, check whether the size of the memory allocated by GE meets the expectation.

    Check the CCE code.

    Analyze the correctness of the CCE code logic. Check whether the values of parameters such as offset, mask, and burst_lenth are correct. Some errors, such as divide-by-zero errors and inf/NAN of instructions, may be caused by incorrect data input.

    Check the register data.

    The register data can help check the CCE code correctness, especially the offset and mask information. You can obtain the data in the register based on the disassembly instruction to infer the CCE code correctness. You can obtain the register data from black box logs.

    Overflow errors

    Check the input and output data.

    The input and output data of a faulty node cannot be dumped. You can check whether overflow data exists on the upstream node. If no overflow data exists on adjacent nodes, it takes a long time to fully check the network. In this case, you can use the commissioning version that supports overflow determination to check the first overflow node on the entire network.

    Compare the single-operator networks.

    Compare the NPU/CPU calculation results based on network data.

    Check the intermediate calculation results.

    Read the in-process data in the corresponding positions such as UB and on-chip memory. By using the commissioning version, you can quickly locate the operator that overflows or introduces inf/NAN data.

    Perform upward source tracing.

    • Unmask the interrupt and check the upstream nodes.
    • Enable overflow detection to locate the node where overflow occurs.