Runtime执行报错,在plog日志中Runtime打印了Memory async copy failed的错误信息。
plog日志在/root/ascend/log/debug/plog路径下,日志格式为plog-pid_yyymmddhhmmss.log。
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.338.095 [engine.cc:1103]420723 ReportExceptProc:Task exception! device_id=0, stream_id=0, task_id=1, type=13, retCode=0x91, [the model stream execute failed]. [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.338.955 [device_error_proc.cc:648]420723 ProcessSdmaErrorInfo:report error module_type=3, module_name=EE8888 [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.075 [task.cc:1582]420723 PrintErrorInfo:Memory async copy failed, device_id=0, stream_id=3, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=0, length=4096. [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.079 [task.cc:1589]420723 PrintErrorInfo:Memory async copy failed, device_id=0, stream_id=3, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=0, length=4096, src_addr=0x124080015000, dst_addr=0x124080016000. [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.082 [task.cc:3276]420723 ReportErrorInfo:model execute error, retCode=0x91, [the model stream execute failed]. [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.085 [task.cc:3248]420723 PrintErrorInfo:model execute task failed, device_id=0, model stream_id=0, model task_id=1, flip_num=0, model_id=0, first_task_id=65535 [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.091 [callback.cc:91]420723 Notify:notify [_DEFAULT_MODEL_NAME_] task fail start.notify taskid:2 streamid:3 retcode:507011 [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.122 [stream.cc:929]420723 GetError:Stream Synchronize failed, stream_id=0, retCode=0x91, [the model stream execute failed]. [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.126 [stream.cc:932]420723 GetError:report error module_type=2, module_name=EI9999 [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.129 [stream.cc:932]420723 GetError:Memory async copy failed, device_id=0, stream_id=3, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=0, length=4096 [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.138 [logger.cc:305]420723 StreamSynchronize:Stream synchronize failed, stream = 0x557647caed50 [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.154 [api_c.cc:661]420723 rtStreamSynchronize:ErrCode=507011, desc=[the model stream execute failed], InnerCode=0x7150050 [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.157 [error_message_manage.cc:49]420723 FuncErrorReason:report error module_type=3, module_name=EE8888 [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.161 [error_message_manage.cc:49]420723 FuncErrorReason:rtStreamSynchronize execute failed, reason=[the model stream execute failed] [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.341.491 [engine.cc:1103]420723 ReportExceptProc:Task exception! device_id=0, stream_id=0, task_id=3, type=13, retCode=0x91, [the model stream execute failed]. [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.342.327 [device_error_proc.cc:648]420723 ProcessSdmaErrorInfo:report error module_type=3, module_name=EE8888 [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.342.370 [task.cc:1582]420723 PrintErrorInfo:Memory async copy failed, device_id=0, stream_id=6, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=1, length=4096. [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.342.373 [task.cc:1589]420723 PrintErrorInfo:Memory async copy failed, device_id=0, stream_id=6, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=1, length=4096, src_addr=0x12408001c000, dst_addr=0x12408001d000. [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.342.376 [task.cc:3276]420723 ReportErrorInfo:model execute error, retCode=0x91, [the model stream execute failed]. [ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.342.379 [task.cc:3248]420723 PrintErrorInfo:model execute task failed, device_id=0, model stream_id=0, model task_id=3, flip_num=0, model_id=1, first_task_id=65535
计算溢出、拷贝地址错误、多P训练时进程退出等。
原因确认:
1、使用/usr/local/Ascend/driver/tools/msnpureport命令收集device日志,并在当前目录下生成文件夹yyyy-mm-dd-hh-mm-ss。
2、查阅device的slog日志,在slog日志中搜索对应的last_cqe_status。device的slog日志在yyyy-mm-dd-hh-mm-ss/slog/dev-os-x/device-x路径下,日志格式为device-x_yyymmddhhmmss.log。
<exception_print>HWTS_SLOT_END: slot_id=22 <exception_print>sdma error, stream_id=3 [ERROR] TSCH(-1,null):2022-08-17-04:49:39.825.435 1893 (dieid:0,cpuid:0) hwts_plat.c:3013 int_process_hwts_sdma_error: SDMA ERROR: slot_id=22, coresponding_channel=0, last_cqe_status=0xb [ERROR] TSCH(-1,null):2022-08-17-04:49:39.825.455 1894 (dieid:0,cpuid:0) sdma_driver.c:216 proc_err_sdma_int_proc: after write sdmam_chl_int_reg_t_val=20000 [ERROR] TSCH(-1,null):2022-08-17-04:49:39.825.478 1895 (dieid:0,cpuid:0) hwts_plat.c:1378 process_hwts_cq_msg_by_sq: stream_id=3, error info: SDMA copy error. <exception_print>TIME:2022-08-17-04:49:39.825.493, device_id=0, stream_id=3, task_id=2, exception_code=0xb6060008, magic_type=0xac79F [ERROR] TSCH(-1,null):2022-08-17-04:49:39.825.520 1897 (dieid:0,cpuid:0) task_scheduler_engine.c:627 set_model_exec_fail: RTS_HWTS: model_id=0, stream_id=3, result=0x91, fault task_id=2, sqe_type=9, task_type=5 [ERROR] TSCH(-1,null):2022-08-17-04:49:39.827.741 1898 (dieid:0,cpuid:0) hwts_plat.c:2950 int_process_hwts_sdma_error: hwts sdma_error_exception,bitmap=0x1000 [ERROR] TSCH(-1,null):2022-08-17-04:49:39.827.769 1899 (dieid:0,cpuid:0) hwts_plat.c:2964 int_process_hwts_sdma_error: Get sq id by slot id, addr=0xba7119f8, stream_id=6, value=0x100000006 <exception_print>HWTS_SLOT_END: slot_id=12 <exception_print>sdma error, stream_id=6 [ERROR] TSCH(-1,null):2022-08-17-04:49:39.828.822 2049 (dieid:0,cpuid:0) hwts_plat.c:3013 int_process_hwts_sdma_error: SDMA ERROR: slot_id=12, coresponding_channel=0, last_cqe_status=0xf [ERROR] TSCH(-1,null):2022-08-17-04:49:39.828.842 2050 (dieid:0,cpuid:0) sdma_driver.c:216 proc_err_sdma_int_proc: after write sdmam_chl_int_reg_t_val=20000 [ERROR] TSCH(-1,null):2022-08-17-04:49:39.828.865 2051 (dieid:0,cpuid:0) hwts_plat.c:1378 process_hwts_cq_msg_by_sq: stream_id=6, error info: SDMA copy error. <exception_print>TIME:2022-08-17-04:49:39.828.881, device_id=0, stream_id=6, task_id=2, exception_code=0xb6060008, magic_type=0xac79F [ERROR] TSCH(-1,null):2022-08-17-04:49:39.828.908 2053 (dieid:0,cpuid:0) task_scheduler_engine.c:627 set_model_exec_fail: RTS_HWTS: model_id=1, stream_id=6, result=0x91, fault task_id=2, sqe_type=9, task_type=5
3、计算出真正的错误码值,错误码值为last_cqe_status值右移1位(Atlas 推理系列产品,不需要右移last_cqe_status,直接使用即可)
例如:上述日志产生于Atlas 推理系列产品,last_cqe_status值可以直接使用,两次Memcpy异步拷贝错误的错误码值分别为0xb和0xf。
4、查阅last_cqe_status对应错误码值发现,0xb代表出现上溢,0xf代表出现源和目的的地址不符合要求。
Value |
Description |
---|---|
0000h |
Successful completion: The command completed successfully |
0001h |
表示可能出现如下错误中的一种:
|
0002h |
SQE进行页表翻译时,SMMU返回了Terminate错误。 |
0003h |
Reserved |
0004h |
SDMAA上报的错误,表示使用了非安全属性访问安全DDR空间; |
0005h |
SDMAA上报的错误,表示SDMAM下发的搬运地址在DAW没有映射。 |
0006h |
SDMAA上报的错误,表示出现操作类型错误; |
0007h |
SDMAA上报的错误,表示SDMAA搬运过程中出现DDRC错误 |
0008h |
SDMAA上报的错误,表示SDMAA搬运过程中出现ECC错误 |
0009h |
SDMAA上报的错误,表示SDMAA搬运过程中出现COMPERR |
000Ah |
SDMAA上报的错误,表示SDMAA搬运过程中出现COMPDATAERR |
000Bh |
SDMAA上报的错误,表示reduce操作出现上溢错误 |
000Ch |
SDMAA上报的错误,表示reduce操作出现下溢错误 |
000Dh |
SDMAA上报的错误,表示reduce源数据格式不符合要求; |
000Eh |
SDMAA上报的错误,表示reduce目的数据格式不符合要求; |
000Fh |
SDMAA上报的错误,表示reduce源和目的数据格式不符合要求; |
else |
Reserved |
这三种错误,由调用者分析具体错误情况。
例如、错误码:0x0009、0x000A、0x000B、0x000C、0x000D、0x000E、0x000F。
该问题会导致ACL报错Execute model failed,并打印在plog日志中。
[ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.339.187 [model.cpp:599]21674 RuntimeV2ModelExecute: [EXEC][DEFAULT][Exec][Model]Execute model failed, ge result[4294967295], modelId[0] [ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.339.193 [model.cpp:1547]21674 aclmdlExecute: [EXEC][DEFAULT][Exec][Model]modelId[0] execute failed, result[500002] ... [ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.342.397 [model.cpp:599]21674 RuntimeV2ModelExecute: [EXEC][DEFAULT][Exec][Model]Execute model failed, ge result[4294967295], modelId[1] [ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.342.399 [model.cpp:1547]21674 aclmdlExecute: [EXEC][DEFAULT][Exec][Model]modelId[1] execute failed, result[500002]