Problem Source Demarcation

After finding the dump data file of the operator where NaN first occurs, determine the root cause based on the actual situation.

  1. If NaN occurs in the output of an operator, obtain the input of the operator and execute the operator on the CPU using the same operator logic to obtain the output of the CPU operator. If the output of the CPU operator does not match that of the NPU operator, the NaN is generated due to incorrect execution of the operator. If the output of the CPU operator matches that of the NPU operator, the NaN is generated due to normal execution of the operator. In this case, you need to find the upstream operator of the operator and check whether the upstream operator has execution problems or other problems.
  2. If NaNs are found in the input of an operator, continue to find the upstream operator to locate the source of NaNs. If the output of the upstream operator is normal but the input of the current operator is abnormal, the fault is caused by memory corruption during the interval from the execution of the upstream operator to the execution of the current operator. Otherwise, check whether the upstream operator has other problems.
  3. If NaNs are found in the embedding variable, the fault is caused by the Rec SDK TensorFlow table lookup.