OOM Error Due to Memory Allocation Failure
Symptom
The memory allocation fails, and the EL9999 return code is displayed in the host log:
[ERROR] DRV(2936187,python3):2022-04-21-14:19:39.429.481 [ascend][curpid: 2936187, 2969960][drv][devmm][devmm_alloc_managed 182]<errno:12, 6> Heap_alloc_managed out of memory. (temp_ptr=0x1; bytesize=8592031776) [ERROR] RUNTIME(2936187,python3):2022-04-21-14:19:39.429.491 [npu_driver.cc:780]2969960 DevMemAllocHugePageManaged:report error module_type=1, module_name=EL9999 [ERROR] RUNTIME(2936187,python3):2022-04-21-14:19:39.429.495 [npu_driver.cc:780]2969960 DevMemAllocHugePageManaged:[driver interface] halMemAlloc failed: device_id=1, size=8592031776, type=0, env_type=3, drvRetCode=6!
Possible Cause
According to the log information, the memory allocation fails. The possible causes are as follows:
- The network is running concurrently, leading to out of memory (OOM).
- The network running requires too large memory, leading to memory allocation failure.
Solution
- Check whether concurrency exists when the network is running.
- Query the memory size required for network running or decrease the value of batchsize to check whether the network is running properly.
Parent topic: Abnormal Resources at Runtime