Error on ResNet-50 on the Cloud (Computing Overflow)
- Locate the abnormal operator.
Use the version with interrupt mask enabled to locate the operator that introduces the inf/NAN data. The error, 0x40000000000000, is reported on the MaxPoolGrad operator.
- Locate the faulty instruction.
Decompile .o, PC Start, Current PC, and the CCE files to locate the error CCE code line. It is found that the error instruction is vsub.
- Analyze the root cause of the error.
Check the parameter values of the vsub instruction and compare them with the amount of migrated data. It is found that the amount of migrated data is 49 x 32 bytes, but the amount of calculated vector data is 8 x 128 bytes. However, there is no mask operation, and the unaligned part is not filled with default values. In this case, the data of the unaligned part is unpredictable. The possible cause is that the data type of the data calculated by the operator is int32. As a result, the data type does not match and is parsed as NAN, and the AI Core reports an error.