网络内存分配失败导致训练异常

问题现象

网络执行过程中出现内存分配失败的问题,例如:

[ERROR] DRV(41852,python3.7):2021-11-11-11:19:04.222.597 [ascend][curpid: 41852, 50983][drv][devmm][devmm_ioctl_advise 190]<errno:12, 6> Ioctl device error! ptr=0x108800000000, count=15069901824, advise=0x8c, device=7.
[ERROR] DRV(41852,python3.7):2021-11-11-11:19:04.817.448 [ascend][curpid: 41852, 50983][drv][devmm][devmm_ioctl_alloc_dev 247]<errno:12, 6> advise mem error! ret=6
[ERROR] DRV(41852,python3.7):2021-11-11-11:19:04.817.465 [ascend][curpid: 41852, 50983][drv][devmm][devmm_virt_heap_alloc_chunk_device 461]<errno:12, 6> devmm_ioctl_alloc error. ptr=0x108800000000.
[ERROR] DRV(41852,python3.7):2021-11-11-11:19:04.817.475 [ascend][curpid: 41852, 50983][drv][devmm][devmm_virt_set_alloced_mem_struct 101]<errno:12, 6> alloc ptr err, ptr=0x1.
[ERROR] DRV(41852,python3.7):2021-11-11-11:19:04.908.424 [ascend][curpid: 41852, 50983][drv][devmm][devmm_alloc_from_base_heap 137]<errno:12, 6> alloc phy mem from base heap err=0x1, va:0x108800000000, size:15069901824,15069901824.
[ERROR] DRV(41852,python3.7):2021-11-11-11:19:04.908.437 [ascend][curpid: 41852, 50983][drv][devmm][devmm_virt_free_check_and_get_pg 365]<errno:12, 6> va(0x108800000000) is not alloced, pg is already in buddy,pfn(544),order(4),flags(1)
[ERROR] DRV(41852,python3.7):2021-11-11-11:19:04.908.446 [ascend][curpid: 41852, 50983][drv][devmm][devmm_virt_heap_free 625]<errno:12, 6> addr not alloced, addr=0x108800000000,start=0x100000000000,end=0x17ffffffffff
[ERROR] DRV(41852,python3.7):2021-11-11-11:19:04.908.459 [ascend][curpid: 41852, 50983][drv][devmm][devmm_alloc_managed 126]<errno:12, 6> heap_alloc_managed out of memory, pp=0x1, bytesize=15069901824.
[ERROR] RUNTIME(41852,python3.7):2021-11-11-11:19:04.908.481 [npu_driver.cc:691]50983 DevMemAllocHugePageManaged:[LOAD][LOAD][driver interface] halMemAlloc failed: device_id=7, size=15069901824, type=2, env_type=3, drvRetCode=6!
[ERROR] DRV(41852,python3.7):2021-11-11-11:19:07.106.816 [ascend][curpid: 41852, 50983][drv][devmm][devmm_ioctl_advise 190]<errno:12, 6> Ioctl device error! ptr=0x108800000000, count=15069901824, advise=0x88, device=7.

原因分析

网络运行默认使用内存动态分配方式,每个图动态申请各自的内存。同时,系统默认将图及变量内存进行隔离,图内存默认为26G,变量内存默认为5G,总和不超过31G。

当网络模型层数过大时,网络中所有图的内存之和很容易超过26G,可能会出现内存分配失败的问题。

此时建议用户使用内存静态分配方式,多个图共享一块内存,最大图内存不超过26G就能保证正常运行。

在内存静态分配方式下,如果仍然存在变量内存超限问题,则可以适当增加变量内存大小,减少网络内存大小,但总和不超过31G。

解决方案

通过配置环境变量设置为静态内存分配方式:

export GE_USE_STATIC_MEMORY=1

在内存静态分配方式下,如果仍然存在变量内存超限问题,则可以通过配置graph_memory_max_size和variable_memory_max_size的大小,来调整内存限制,前提是权重和Feature map总内存不超过31G。