OOM Error Due to Memory Allocation Failure

Symptom

The memory allocation fails, and the EL9999 return code is displayed in the host log:

[ERROR] DRV(2936187,python3):2022-04-21-14:19:39.429.481 [ascend][curpid: 2936187, 2969960][drv][devmm][devmm_alloc_managed 182]<errno:12, 6> Heap_alloc_managed out of memory. (temp_ptr=0x1; bytesize=8592031776)
[ERROR] RUNTIME(2936187,python3):2022-04-21-14:19:39.429.491 [npu_driver.cc:780]2969960 DevMemAllocHugePageManaged:report error module_type=1, module_name=EL9999
[ERROR] RUNTIME(2936187,python3):2022-04-21-14:19:39.429.495 [npu_driver.cc:780]2969960 DevMemAllocHugePageManaged:[driver interface] halMemAlloc failed: device_id=1, size=8592031776, type=0, env_type=3, drvRetCode=6!

Possible Cause

According to the log information, the memory allocation fails. The possible causes are as follows:

  1. The network is running concurrently, leading to out of memory (OOM).
  2. The network running requires too large memory, leading to memory allocation failure.

Solution

  1. Check whether concurrency exists when the network is running.
  2. Query the memory size required for network running or decrease the value of batchsize to check whether the network is running properly.