The Operator Input Arg Error
Symptom
On the
[ERROR] RUNTIME(85483,python):2024-04-19-18:13:17.186.318 [task_info.cc:1678]85658 PrintErrorInfoForDavinciTask:Aicore kernel execute failed, device_id=0, stream_id=2, report_stream_id=2, task_id=57783, flip_num=3, fault kernel_name=RealDiv_ee98c6628030785f610b924ab1557b31_high_precision_210000000, fault kernel info ext=none, program id=9, hash=10612039229658031084. [ERROR] RUNTIME(85483,python):2024-04-19-18:13:17.186.336 [task_info.cc:1617]85658 GetArgsInfo:[AIC_INFO] args(0 to 9) after execute:0x4f453840, 0x124201ea7400, 0x12420240cc00, 0x1241c006dc28, 0x124100011000, 0x1, 0x1, 0x1, 0,
Fault Root Causes
In the error information in the plog log, the args(xxxx) after execute part is critical. You need to check whether the parameter address in the args after execute part is proper. If 0 or some strange values are displayed, the problem is caused by incorrect address allocation. For example: On other products, the address starts with 0x1240. If the address does not start with 0x1240, an exception may occur and needs to be checked.
In the preceding error log, 0x4f453840, 0x124201ea7400, 0x12420240cc00 and 0x1241c006dc28 are displayed in args after execute on the
Solution
After checking the training script with the user, it is found that cpu_tensor/npu_tensor is used when the user uses division. cpu_tensor is the host memory address, and npu_tensor is the device memory address. As a result, the AI Core fails to read data, it is necessary to change the host address memory address to device memory address.