Method for Handling Insufficient Memory

Symptom

During tuning in the TensorFlow training scenario, errors similar to the following may be reported:

  • Error 1
    [ERROR] GE(685297,python3):2022-04-06-07:15:09.383.996 [graph_var_manager.cc:402]687614 AssignVarMem: ErrorNo: 1343225857(Parameter's invalid!) [COMP][MEM_ALLOC][Check][Param] Out of memory: current var size[13962468864] exceeds total var size[13958643712]
  • Error 2
    [ERROR] GE(685297,python3):2022-04-06-07:15:09.383.996 [graph_var_manager.cc:402]687614 AssignVarMem: ErrorNo: 1343225857(Parameter's invalid!) [COMP][MEM_ALLOC][Check][Param] Out of memory: current graph size[13962468864] exceeds total graph size[13958643712]

Solution

Modify session configuration options in sess.run mode or npu_bridge.estimator.npu.npu_config in Estimator mode. The modification method is as follows:

  • Error 1: Change the value of variable_memory_max_size to current var size + 1 GB in the error message and that of graph_memory_max_size to 31 GB – variable_memory_max_size. For details about configuration options, see Table 1.
  • Error 2: Change the value of graph_memory_max_size to current graph size + 1 GB in the error message and that of variable_memory_max_size to 31 GB – graph_memory_max_size. For details about configuration options, see Table 1.
Table 1 Configuration options

Configuration Option

Description

graph_memory_max_size

Network static memory size and maximum dynamic memory size. Varies according to the network size. The unit is byte and the value range is [0, 256 x 1024 x 1024 x 1024] or [0, 274877906944]. The Ascend AI Processor hardware requires that the sum of graph_memory_max_size and variable_memory_max_size be 31 GB. If not set, it uses 26 GB by default.

Example:

  • In sess.run mode:
    custom_op.parameter_map["graph_memory_max_size"].s = tf.compat.as_bytes(str(26*1024 * 1024 * 1024))
  • In Estimator mode:
    config = NPURunConfig(graph_memory_max_size=str(26*1024 * 1024 * 1024))
    variable_memory_max_size

variable_memory_max_size

Variable memory size. Varies according to the network size. The value ranges from 0 to 256 x 1024 x 1024 x 1024 or 0 to 274877906944, in bytes. The Ascend AI Processor hardware requires that the sum of graph_memory_max_size and variable_memory_max_size be 31 GB. If not set, it uses 5 GB by default.

Example:

  • In sess.run mode:
    custom_op.parameter_map["variable_memory_max_size"].s = tf.compat.as_bytes(str(5*1024 * 1024 * 1024))
  • In Estimator mode:
    config = NPURunConfig(variable_memory_max_size=str(5*1024 * 1024 * 1024))