在
[ERROR] RUNTIME(85483,python):2024-04-19-18:13:17.186.318 [task_info.cc:1678]85658 PrintErrorInfoForDavinciTask:Aicore kernel execute failed, device_id=0, stream_id=2, report_stream_id=2, task_id=57783, flip_num=3, fault kernel_name=RealDiv_ee98c6628030785f610b924ab1557b31_high_precision_210000000, fault kernel info ext=none, program id=9, hash=10612039229658031084. [ERROR] RUNTIME(85483,python):2024-04-19-18:13:17.186.336 [task_info.cc:1617]85658 GetArgsInfo:[AIC_INFO] args(0 to 9) after execute:0x4f453840, 0x124201ea7400, 0x12420240cc00, 0x1241c006dc28, 0x124100011000, 0x1, 0x1, 0x1, 0,
在plog日志的报错信息中,args(xxxx) after execute部分的日志很关键,需检查args after excute处的参数地址是否合理,如果出现0,或者一些非常奇怪的值,可以认定是地址分配错误导致,例如,在
以上报错日志中,在
通过与用户联合排查训练脚本,发现用户使用除法时,使用了cpu_tensor/npu_tensor,cpu_tensor是Host内存地址,npu_tensor是Device内存地址,导致AI Core读取错误,需修改为都是Device内存地址。