异步拷贝调用查询接口报错
问题现象描述
通过event实现H2D或D2H异步拷贝任务的同步等待时,在调用aclrtQueryEventStatus确认任务完成后,先调用aclrtFreeHost释放Host内存再调用aclrtDestroyEvent接口,可能会有如下报错信息打印:
[Event] DRV(78295,python):2023-01-10-11:21:48.757.930 [ascend] [curpid: 78295, 78295][drv][common][share_log_read 544][ascend] [ERROR] [devmm] <python:3960,3960> Set free error. (ref_lock=1; ref_free=0; ref_count=3) [ascend] [ERROR] [devmm] <python:3960,3960> Oper address failed. (va=0x120043200000; ref_flag=0x108; ref_lock=0; ref_free=0; ref_count=3; convert=1; async=0) [ascend] [ERROR] [devmm] <python:3960,3960> Vaddress can not oper. (cmd=Ox42204d04; cmd_id=Ox4; ret=-22) [ERROR] DRV(78295,python):2023-01-10-11:21:48,757,946 [ascend][curpid: 78295, 78295][drv][devmm][devmm_ioctl_free_pages 138]<errno:26, 17> Ioctl device error. (ret=17) [ERROR] DRV(78295,python):2023-01-10-11:21:48,757,951 [ascend][curpid: 78295, 78295][drv][devmm][devmm_virt_heap_free_pages 294]<errno:26, 17> Devmm_iotcl_free failed. (ptr=0x120043200000; heap_type=4025417729) [ERROR] DRV(78295,python):2023-01-10-11:21:48,757,956 [ascend][curpid: 78295, 78295][drv][devmm][devmm_free_phymem_heap_oper 323]<errno:26, 17> Heap ops failed. (ret=17; va=0x120043200000) [ERROR] DRV(78295,python):2023-01-10-11:21:48,757,962 [ascend][curpid: 78295, 78295][drv][devmm][devmm_free_nocache_mem_process 992]<errno:26, 17> Free error. (va=Ox120043200000; size=123731968; total=123731968; ret=17) [ERROR] DRV(78295,python):2023-01-10-11:21:48,757,970 [ascend][curpid: 78295, 78295][drv][devmm][devmm_free_to_normal_heap 704]<errno:26, 17> Virt_heap_free_mem failed. (ret=17; va=Ox120043200000) [ERROR] RUNTIME(78295,python):2023-01-10-11:21:48.758.013 [npu_driver.cc:1531]78295 HostMemFree:[FINAL][FINAL]report error module_type=1, module_name=EL9999 [ERROR] RUNTIME(78295,python):2023-01-10-11:21:48.758.021 [npu_driver.cc:1531]78295 HostMemFree:[FINAL][FINAL] [drv api] halMemFree failed: drvRetCode=17! [ERROR] RUNTIME(78295,python):2023-01-10-11:21:48.758.073 [logger.cc:388]78295 HostFree:[FINAL][FINAL]Free host memory failed. [ERROR] RUNTIME(78295,python):2023-01-10-11:21:48.758.097 [api_c.cc:1077]78295 rtFreeHost:[FINAL][FINAL]ErrCode=507899, desc=[driver error:internal error], InnerCode=0x7020022 [ERROR] RUNTIME(78295,python):2023-01-10-11:21:48.758.102 [error_message_manage.cc:49]78295 FuncErrorReason:[FINAL][FINAL]report error module_type=3, module_name=EE8888 [ERROR] RUNTIME(78295,python):2023-01-10-11:21:48.758.110 [error_message_manage.cc:49]78295 FuncErrorReason:[FINAL][FINAL]rtFreeHost execute failed, reason=[driver error:internal error] [ERROR] ASCENDCL(78295,python):2023-01-10-11:21:48.758.133 [memory.cpp:242]78295 aclrtFreeHost: [FINAL][FINAL]free host memory failed, runtime result = 507899
可能原因
报错是因为使用了异步拷贝任务之后下发了一个event record任务,期望使用aclrtQueryEventStatus查询到event record任务是否完成,从而判断异步拷贝任务是否完成,而后释放内存调用aclrtFreeHost。
实际上aclrtQueryEventStatus查询到的是Device执行完任务,并未透传到Host侧,所以此时释放内存,未先销毁Event会有时序问题导致报错。
处理步骤
处理该问题可以参考以下方案:
方案一:使用aclrtSynchronizeStream接口判断任务是否执行完成。
方案二:使用aclrtQueryEventStatus接口时,先调用aclrtDestroyEvent接口,再调用aclrtFreeHost接口,保证无时序问题。
父主题: 运行时资源异常问题