Error Code 0x4

Fault Locating

***********************2. AICERROR code***********************
# Gives the AI Core error code and description.
code  : 0x4

This error code indicates that the function call depth exceeds the configured value. That is, the number of call layers between CCE operator functions exceeds the threshold, which is controlled by a register. The threshold is described as follows.

[5:2]: T indicates the call stack overflow threshold. If the number of call layers exceeds T+1, a maskable error is reported. The size of a UB call stack is limited. Assuming that each function call occupies 4 KB stack, and UB reserves 16 KB as the total stack, the initial reset value is 3.

The default value is 3 in V100 ISA, which is changed to 2 in V200 ISA. As long as the number of call layers is fewer than 4, the error does not occur. Therefore, the following two scenarios are possible:

  • There are more than four layers of operator calls in the CCE code.
  • The CCE code has an operation that modifies the register, so that the threshold is changed to a smaller value.

Troubleshooting Procedure

  1. Locate the faulty operator based on the device ID, core ID, task ID, stream ID, node name, kernel name, and operator address in listed in 1. Basic Information of the log.
    ***********************1. Basic information********************
    # Gives the basic information about the device occurred with the AI Core error.
    # kernel name: operator kernel name
    # op address: address of the operator code in the DDR
    # args address: address of the operator arguments in the DDR
    error time   : 2020-08-26-11:24:07
    device id    : 0
    core id      : 0
    task id      : 60
    stream id    : 517
    node name    : trans_TransData_167
    kernel name  : te_transdata_16b6e15e2a5cc7f70_33e5fb7ae8478ddb
    op address   : 0x101000120000
    args address : 0X101000053000
  2. Check whether the corresponding operator has more than four call layers.
    • If yes, the error has been located.
    • If no, go to 3.
  3. Check whether the register is directly set or modified in the CCE code.
    • If yes, the error has been located.
    • If no, further locate the error.

Setting the CTRL Instruction

In the scenario where atomic_add needs to be implemented, set the CTRL instruction as follows (FP32 is used as an example).

uint64_t ctrl_reg = get_ctrl();
uuint64_t config_reg = (ctrl_reg | ((uint64_t)1 << 60)) ;    // uint64_t configReg = (ctrlReg | ((uint64_t)1 << 60)) & (~((uint64_t)0 << 61)); This is a little more standard compliant.
set_ctrl(config_reg);
set_ctrl(ctrl_reg);   // Instruction restoration