SDMA ERROR(EI0012)
问题现象
在CANN日志中存在关键字"fftsplus sdma error",如下所示:
[ERROR] RUNTIME(57096,python3.10):2025-05-12-20:55:44.705.025 [task_info.cc:1170]57288 PrintSdmaErrorInfoForFftsPlusTask:fftsplus task execute failed, dev_id=0, stream_id=50, task_id=21, context_id=18, thread_id=0, err_type=4[fftsplus sdma error] [ERROR] RUNTIME(57096,python3.10):2025-05-12-20:55:44.705.031 [task_info.cc:1270]57288 TaskFailCallBackForFftsPlusTask:fftsplus streamId=50, taskId=21, context_id=18, expandtype=1, rtCode=0x715006c,[fftsplus task exception], psStart=0x0, kernel_name=not found kernel name, binHandle=(nil), binSize=0. [ERROR] HCCL(57096,python3.10):2025-05-12-20:55:44.706.132 [task_exception_handler.cc:947] [57288][TaskExceptionHandler][DealExceptionOp]FFTS+ run failed, base information is streamID:[32], taskID[21], tag[AllGather_group_name_0], AlgType(level 0-1-2):[fullmesh-ring-H-D]. [ERROR] HCCL(57096,python3.10):2025-05-12-20:55:44.706.140 [task_exception_handler.cc:810] [57288][TaskExceptionHandler][Callback]FFTS+ run failed, groupRank information is group:[group_name_0], user define information[Unspecified], rankSize[8], rankId[0]. [ERROR] HCCL(57096,python3.10):2025-05-12-20:55:44.706.163 [task_exception_handler.cc:737] [57288][TaskExceptionHandler][Callback]FFTS+ run failed, opData information is timeStamp:[2025-05-12-20:54:51.268.778], deviceId[0], index[4], count[3397632], src[0x12c25487ac00], dst[0x12c255000000], dataType[uint8].
问题根因
在执行SDMA内存拷贝任务时发生了页表转换失败,也就是内存拷贝的输入或者输出地址未分配内存、分配的内存小于内存拷贝大小或者分配的内存已被释放。
常见的问题根因有以下场景:
父主题: task exception机制