Accuracy Overflow/Underflow During Atomic Add

Analysis Result

If the info.txt file provides the following conclusion, the AI Core error is caused by precision overflow.

Analysis result: success.
"**********************Root cause conclusion******************"
"dha status 1" found in log. It means Atomic accumulation exception, please check the input data and network accuracy.
Attention please,  if multiple tasks are running on the same device at the same time, false positives may be generated. You are advised to pull up only one task and collect it .

In the slog file (report/*/slog/dev-os-id/[run|debug]/device-os/device-os_*.log) of the device, check whether the keyword Vm fault failed exists. If the keyword does not exist, the AI Core error is caused by atomic accuracy overflow/underflow. If it exists, memory overwriting occurs. An example of an alog log is as follows:

2024-02-20-17-07-25/slog/dev-os-0/debug/device-os/device-os_20240122045443091.log:461:[EVENT] KERNEL(4128,sklogd):2024-01-22-05:03:02.259.649 [klogd.c:253][2572550.901383] [ascend] [ERROR] [devmm] [devmm_svm_device_fault 438] <kworker/u16:186:9871,9871> Vm fault failed. (hostpid=1885445; devid=0; vfid=0; ret=64; fault_addr=0x1240f1fa0000; start=0x1240f1fa0000)

Fault Root Causes

During operator calculation, extreme data encounters the atomic accumulation instruction. If overflow occurs during atomic accumulation, the 0x800000 error is reported.

Solution

Generally, this problem is caused by incorrect input data. You need to optimize the precision to locate the problem.

In the inference scenario, optimize the precision by referring to section Precision/Performance Optimization in the CANN AscendCL Application Software Development Guide (C&C++).

In the training scenario, for the TensorFlow framework, optimize the precision by referring to section Precision Optimization in TensorFlow 2.6.5 Model Porting Guide.

In the training scenario, for the PyTorch framework, optimize the precision by referring to section Precision Commissioning in the .