BERT Training (Error on Other Nodes, Upward Source Tracing)

  1. Locate the faulty operator and instruction.

    The operator for which an error is reported is the sum operator. After the operator decompilation file and CCE file are analyzed, no logic error is found. Considering that the sum operator involves the atomic add operation, it is suspected that the data contains NAN, which causes the atomic overflow.

  2. Use the version with interrupt mask enabled to run the network again. It is found that the divide-by-zero error "AIC_ERROR:0X10000000000000" is displayed, and the error operator is truediv. Therefore, it can be inferred that NAN is caused by division by zero.
  3. Locate the cause of the divide-by-zero error.

    When reading the UB data, it is found that the second input of trudiv contains some data whose values are 0. According to the algorithm of the original model, divide-by-zero errors may occur. However, the result of division by zero will not be used if you use the great + select operators. That is, even if divide-by-zero error occurs, inf/NAN will not be passed down. Further check the reason why the inf/NAN is incorrectly passed down.

  4. It is suspected that the implementation of the great+select operators is incorrect. As a result, the data containing NAN is not filtered out. After checking the operator implementation, it is found that the select operator indirectly implements the select function through a series of combination instructions of vmin vmul. When the input data contains NAN, the NAN is output regardless of whether the condition is true or false, and the original result required by the algorithm is not obtained.
  5. Locate the root cause.

    The implementation of the select operator is incorrect. As a result, the protection for division by zero on the network becomes invalid, so the divide-by-zero NAN is passed down, and the atomic overflows.