Memcpy异步拷贝算子执行报错

适用场景

现象描述

Runtime执行报错,在plog日志中Runtime打印了Memory async copy failed的错误信息。

plog日志在/root/ascend/log/debug/plog路径下,日志格式为plog-pid_yyymmddhhmmss.log。

[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.338.095 [engine.cc:1103]420723 ReportExceptProc:Task exception! device_id=0, stream_id=0, task_id=1, type=13, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.338.955 [device_error_proc.cc:648]420723 ProcessSdmaErrorInfo:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.075 [task.cc:1582]420723 PrintErrorInfo:Memory async copy failed, device_id=0, stream_id=3, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=0, length=4096.
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.079 [task.cc:1589]420723 PrintErrorInfo:Memory async copy failed, device_id=0, stream_id=3, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=0, length=4096, src_addr=0x124080015000, dst_addr=0x124080016000.
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.082 [task.cc:3276]420723 ReportErrorInfo:model execute error, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.085 [task.cc:3248]420723 PrintErrorInfo:model execute task failed, device_id=0, model stream_id=0, model task_id=1, flip_num=0, model_id=0, first_task_id=65535
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.091 [callback.cc:91]420723 Notify:notify [_DEFAULT_MODEL_NAME_] task fail start.notify taskid:2 streamid:3 retcode:507011
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.122 [stream.cc:929]420723 GetError:Stream Synchronize failed, stream_id=0, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.126 [stream.cc:932]420723 GetError:report error module_type=2, module_name=EI9999
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.129 [stream.cc:932]420723 GetError:Memory async copy failed, device_id=0, stream_id=3, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=0, length=4096
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.138 [logger.cc:305]420723 StreamSynchronize:Stream synchronize failed, stream = 0x557647caed50
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.154 [api_c.cc:661]420723 rtStreamSynchronize:ErrCode=507011, desc=[the model stream execute failed], InnerCode=0x7150050
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.157 [error_message_manage.cc:49]420723 FuncErrorReason:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.339.161 [error_message_manage.cc:49]420723 FuncErrorReason:rtStreamSynchronize execute failed, reason=[the model stream execute failed]
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.341.491 [engine.cc:1103]420723 ReportExceptProc:Task exception! device_id=0, stream_id=0, task_id=3, type=13, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.342.327 [device_error_proc.cc:648]420723 ProcessSdmaErrorInfo:report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.342.370 [task.cc:1582]420723 PrintErrorInfo:Memory async copy failed, device_id=0, stream_id=6, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=1, length=4096.
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.342.373 [task.cc:1589]420723 PrintErrorInfo:Memory async copy failed, device_id=0, stream_id=6, task_id=2, flip_num=0, copy_type=10, memcpy_type=0, copy_data_type=1, length=4096, src_addr=0x12408001c000, dst_addr=0x12408001d000.
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.342.376 [task.cc:3276]420723 ReportErrorInfo:model execute error, retCode=0x91, [the model stream execute failed].
[ERROR] RUNTIME(420723,msame):2022-08-17-04:49:06.342.379 [task.cc:3248]420723 PrintErrorInfo:model execute task failed, device_id=0, model stream_id=0, model task_id=3, flip_num=0, model_id=1, first_task_id=65535

可能原因

计算溢出、拷贝地址错误、多P训练时进程退出等。

原因确认:

1、使用/usr/local/Ascend/driver/tools/msnpureport命令收集device日志,并在当前目录下生成文件夹yyyy-mm-dd-hh-mm-ss。

2、查阅device的slog日志,在slog日志中搜索对应的last_cqe_status。device的slog日志在yyyy-mm-dd-hh-mm-ss/slog/dev-os-x/device-x路径下,日志格式为device-x_yyymmddhhmmss.log。

<exception_print>HWTS_SLOT_END: slot_id=22
<exception_print>sdma error, stream_id=3
[ERROR] TSCH(-1,null):2022-08-17-04:49:39.825.435 1893 (dieid:0,cpuid:0) hwts_plat.c:3013 int_process_hwts_sdma_error: SDMA ERROR: slot_id=22, coresponding_channel=0, last_cqe_status=0xb
[ERROR] TSCH(-1,null):2022-08-17-04:49:39.825.455 1894 (dieid:0,cpuid:0) sdma_driver.c:216 proc_err_sdma_int_proc: after write sdmam_chl_int_reg_t_val=20000
[ERROR] TSCH(-1,null):2022-08-17-04:49:39.825.478 1895 (dieid:0,cpuid:0) hwts_plat.c:1378 process_hwts_cq_msg_by_sq: stream_id=3, error info: SDMA copy error.
<exception_print>TIME:2022-08-17-04:49:39.825.493, device_id=0, stream_id=3, task_id=2, exception_code=0xb6060008, magic_type=0xac79F
[ERROR] TSCH(-1,null):2022-08-17-04:49:39.825.520 1897 (dieid:0,cpuid:0) task_scheduler_engine.c:627 set_model_exec_fail: RTS_HWTS: model_id=0, stream_id=3, result=0x91, fault task_id=2, sqe_type=9, task_type=5
[ERROR] TSCH(-1,null):2022-08-17-04:49:39.827.741 1898 (dieid:0,cpuid:0) hwts_plat.c:2950 int_process_hwts_sdma_error: hwts sdma_error_exception,bitmap=0x1000
[ERROR] TSCH(-1,null):2022-08-17-04:49:39.827.769 1899 (dieid:0,cpuid:0) hwts_plat.c:2964 int_process_hwts_sdma_error: Get sq id by slot id, addr=0xba7119f8, stream_id=6, value=0x100000006

<exception_print>HWTS_SLOT_END: slot_id=12
<exception_print>sdma error, stream_id=6
[ERROR] TSCH(-1,null):2022-08-17-04:49:39.828.822 2049 (dieid:0,cpuid:0) hwts_plat.c:3013 int_process_hwts_sdma_error: SDMA ERROR: slot_id=12, coresponding_channel=0, last_cqe_status=0xf
[ERROR] TSCH(-1,null):2022-08-17-04:49:39.828.842 2050 (dieid:0,cpuid:0) sdma_driver.c:216 proc_err_sdma_int_proc: after write sdmam_chl_int_reg_t_val=20000
[ERROR] TSCH(-1,null):2022-08-17-04:49:39.828.865 2051 (dieid:0,cpuid:0) hwts_plat.c:1378 process_hwts_cq_msg_by_sq: stream_id=6, error info: SDMA copy error.
<exception_print>TIME:2022-08-17-04:49:39.828.881, device_id=0, stream_id=6, task_id=2, exception_code=0xb6060008, magic_type=0xac79F
[ERROR] TSCH(-1,null):2022-08-17-04:49:39.828.908 2053 (dieid:0,cpuid:0) task_scheduler_engine.c:627 set_model_exec_fail: RTS_HWTS: model_id=1, stream_id=6, result=0x91, fault task_id=2, sqe_type=9, task_type=5

3、计算出真正的错误码值,错误码值为last_cqe_status值右移1位(Atlas 推理系列产品,不需要右移last_cqe_status,直接使用即可)

例如:上述日志产生于Atlas 推理系列产品,last_cqe_status值可以直接使用,两次Memcpy异步拷贝错误的错误码值分别为0xb和0xf。

4、查阅last_cqe_status对应错误码值发现,0xb代表出现上溢,0xf代表出现源和目的的地址不符合要求。

表1 last_cqe_status对应错误码值

Value

Description

0000h

Successful completion: The command completed successfully

0001h

表示可能出现如下错误中的一种:

  1. Submission Descriptor read response error;
  2. SQE中出现非法OPCODE;

0002h

SQE进行页表翻译时,SMMU返回了Terminate错误。

0003h

Reserved

0004h

SDMAA上报的错误,表示使用了非安全属性访问安全DDR空间;

0005h

SDMAA上报的错误,表示SDMAM下发的搬运地址在DAW没有映射。

0006h

SDMAA上报的错误,表示出现操作类型错误;

0007h

SDMAA上报的错误,表示SDMAA搬运过程中出现DDRC错误

0008h

SDMAA上报的错误,表示SDMAA搬运过程中出现ECC错误

0009h

SDMAA上报的错误,表示SDMAA搬运过程中出现COMPERR

000Ah

SDMAA上报的错误,表示SDMAA搬运过程中出现COMPDATAERR

000Bh

SDMAA上报的错误,表示reduce操作出现上溢错误

000Ch

SDMAA上报的错误,表示reduce操作出现下溢错误

000Dh

SDMAA上报的错误,表示reduce源数据格式不符合要求;

000Eh

SDMAA上报的错误,表示reduce目的数据格式不符合要求;

000Fh

SDMAA上报的错误,表示reduce源和目的数据格式不符合要求;

else

Reserved

处理步骤

  1. 根据错误码提示,进行后续处理。

    例如,错误码0x0002代表缺少页表

    • 地址错误,使用了未申请地址或者使用时地址已经释放。
    • 拷贝长度错误,拷贝越界到未申请的地址上。
    • 进程结束,地址被系统释放,导致错误。

    这三种错误,由调用者分析具体错误情况。

    例如、错误码:0x0009、0x000A、0x000B、0x000C、0x000D、0x000E、0x000F。

    • 拷贝数据错误,仅HCCL使用,一般是算子溢出,或者即将溢出出现了拷贝报错。通过开启溢出dump功能进行检测定位,发现溢出问题按照算子溢出进行深入定位。
    • 如果通过算子溢出未定位到故障原因,则使用普通dump功能进行精度比对,进行网络精度分析。
    • 如果仍然存在问题,需要联系技术支持定位排查。 您可以获取日志后单击Link联系技术支持。

可能导致的故障

该问题会导致ACL报错Execute model failed,并打印在plog日志中。

[ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.339.187 [model.cpp:599]21674 RuntimeV2ModelExecute: [EXEC][DEFAULT][Exec][Model]Execute model failed, ge result[4294967295], modelId[0]
[ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.339.193 [model.cpp:1547]21674 aclmdlExecute: [EXEC][DEFAULT][Exec][Model]modelId[0] execute failed, result[500002]
...
[ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.342.397 [model.cpp:599]21674 RuntimeV2ModelExecute: [EXEC][DEFAULT][Exec][Model]Execute model failed, ge result[4294967295], modelId[1]
[ERROR] ASCENDCL(16243,msame):2022-08-17-04:49:06.342.399 [model.cpp:1547]21674 aclmdlExecute: [EXEC][DEFAULT][Exec][Model]modelId[1] execute failed, result[500002]