Method for Handling Insufficient Memory
Symptom
During tuning in the TensorFlow training scenario, errors similar to the following may be reported:
- Error 1
[ERROR] GE(685297,python3):2022-04-06-07:15:09.383.996 [graph_var_manager.cc:402]687614 AssignVarMem: ErrorNo: 1343225857(Parameter's invalid!) [COMP][MEM_ALLOC][Check][Param] Out of memory: current var size[13962468864] exceeds total var size[13958643712]
- Error 2
[ERROR] GE(685297,python3):2022-04-06-07:15:09.383.996 [graph_var_manager.cc:402]687614 AssignVarMem: ErrorNo: 1343225857(Parameter's invalid!) [COMP][MEM_ALLOC][Check][Param] Out of memory: current graph size[13962468864] exceeds total graph size[13958643712]
Solution
Modify session configuration options in sess.run mode or npu_bridge.estimator.npu.npu_config in Estimator mode. The modification method is as follows:
- Error 1: Change the value of variable_memory_max_size to current var size + 1 GB in the error message and that of graph_memory_max_size to 31 GB – variable_memory_max_size. For details about configuration options, see Table 1.
- Error 2: Change the value of graph_memory_max_size to current graph size + 1 GB in the error message and that of variable_memory_max_size to 31 GB – graph_memory_max_size. For details about configuration options, see Table 1.
Configuration Option |
Description |
|---|---|
graph_memory_max_size |
Network static memory size and maximum dynamic memory size. Varies according to the network size. The unit is byte and the value range is [0, 256 x 1024 x 1024 x 1024] or [0, 274877906944]. The Ascend AI Processor hardware requires that the sum of graph_memory_max_size and variable_memory_max_size be 31 GB. If not set, it uses 26 GB by default. Example:
|
variable_memory_max_size |
Variable memory size. Varies according to the network size. The value ranges from 0 to 256 x 1024 x 1024 x 1024 or 0 to 274877906944, in bytes. The Ascend AI Processor hardware requires that the sum of graph_memory_max_size and variable_memory_max_size be 31 GB. If not set, it uses 5 GB by default. Example:
|