昇腾社区首页
中文
注册

异步拷贝调用查询接口报错

问题现象描述

通过event实现H2D或D2H异步拷贝任务的同步等待时,在调用aclrtQueryEventStatus确认任务完成后,先调用aclrtFreeHost释放Host内存再调用aclrtDestroyEvent接口,可能会有如下报错信息打印:

[Event] DRV(78295,python):2023-01-10-11:21:48.757.930 [ascend] [curpid: 78295, 78295][drv][common][share_log_read 544][ascend] [ERROR] [devmm] <python:3960,3960> Set free error. (ref_lock=1; ref_free=0; ref_count=3)
[ascend] [ERROR] [devmm] <python:3960,3960> Oper address failed. (va=0x120043200000; ref_flag=0x108; ref_lock=0; ref_free=0; ref_count=3; convert=1; async=0)
[ascend] [ERROR] [devmm] <python:3960,3960> Vaddress can not oper. (cmd=Ox42204d04; cmd_id=Ox4; ret=-22)
[ERROR] DRV(78295,python):2023-01-10-11:21:48,757,946 [ascend][curpid: 78295, 78295][drv][devmm][devmm_ioctl_free_pages 138]<errno:26, 17> Ioctl device error. (ret=17)
[ERROR] DRV(78295,python):2023-01-10-11:21:48,757,951 [ascend][curpid: 78295, 78295][drv][devmm][devmm_virt_heap_free_pages 294]<errno:26, 17> Devmm_iotcl_free failed. (ptr=0x120043200000; heap_type=4025417729)
[ERROR] DRV(78295,python):2023-01-10-11:21:48,757,956 [ascend][curpid: 78295, 78295][drv][devmm][devmm_free_phymem_heap_oper 323]<errno:26, 17> Heap ops failed. (ret=17; va=0x120043200000)
[ERROR] DRV(78295,python):2023-01-10-11:21:48,757,962 [ascend][curpid: 78295, 78295][drv][devmm][devmm_free_nocache_mem_process 992]<errno:26, 17> Free error. (va=Ox120043200000; size=123731968; total=123731968; ret=17)
[ERROR] DRV(78295,python):2023-01-10-11:21:48,757,970 [ascend][curpid: 78295, 78295][drv][devmm][devmm_free_to_normal_heap 704]<errno:26, 17> Virt_heap_free_mem failed. (ret=17; va=Ox120043200000)
[ERROR] RUNTIME(78295,python):2023-01-10-11:21:48.758.013 [npu_driver.cc:1531]78295 HostMemFree:[FINAL][FINAL]report error module_type=1, module_name=EL9999
[ERROR] RUNTIME(78295,python):2023-01-10-11:21:48.758.021 [npu_driver.cc:1531]78295 HostMemFree:[FINAL][FINAL] [drv api] halMemFree failed: drvRetCode=17!
[ERROR] RUNTIME(78295,python):2023-01-10-11:21:48.758.073 [logger.cc:388]78295 HostFree:[FINAL][FINAL]Free host memory failed.
[ERROR] RUNTIME(78295,python):2023-01-10-11:21:48.758.097 [api_c.cc:1077]78295 rtFreeHost:[FINAL][FINAL]ErrCode=507899, desc=[driver error:internal error], InnerCode=0x7020022
[ERROR] RUNTIME(78295,python):2023-01-10-11:21:48.758.102 [error_message_manage.cc:49]78295 FuncErrorReason:[FINAL][FINAL]report error module_type=3, module_name=EE8888
[ERROR] RUNTIME(78295,python):2023-01-10-11:21:48.758.110 [error_message_manage.cc:49]78295 FuncErrorReason:[FINAL][FINAL]rtFreeHost execute failed, reason=[driver error:internal error]
[ERROR] ASCENDCL(78295,python):2023-01-10-11:21:48.758.133 [memory.cpp:242]78295 aclrtFreeHost: [FINAL][FINAL]free host memory failed, runtime result = 507899

可能原因

报错是因为使用了异步拷贝任务之后下发了一个event record任务,期望使用aclrtQueryEventStatus查询到event record任务是否完成,从而判断异步拷贝任务是否完成,而后释放内存调用aclrtFreeHost。

实际上aclrtQueryEventStatus查询到的是Device执行完任务,并未透传到Host侧,所以此时释放内存,未先销毁Event会有时序问题导致报错。

处理步骤

处理该问题可以参考以下方案:

方案一:使用aclrtSynchronizeStream接口判断任务是否执行完成。

方案二:使用aclrtQueryEventStatus接口时,先调用aclrtDestroyEvent接口,再调用aclrtFreeHost接口,保证无时序问题。