昇腾故障案例详情页

变量内存超限导致训练异常

更新时间: 2022/07/26

暂无评分

问题信息

问题来源产品大类产品子类关键字
官方模型训练TensorFlow网络Batch Size、规模、内存超限

问题现象描述

网络Batch Size或规模设置过大时报内存超限。

[ERROR] GE(179560,python3.7):2020-10-31-11:06:40.656.258 [graphengine/ge/graph/manager/graph_var_manager.cc:285]182539 AssignVarMem: ErrorNo: 1343225857(Parameter's invalid!) Out of memory : current var size[5382237696] exceeds total var size[5368709120]
[ERROR] GE(179560,python3.7):2020-10-31-11:06:40.656.374 [graphengine/ge/graph/manager/graph_var_manager.cc:504]182539 AssignVarMem: ErrorNo: 1343225860(Internal errors) AssignVarMem by offset failed.
[ERROR] GE(179560,python3.7):2020-10-31-11:06:40.656.420 [graphengine/ge/graph/build/memory/var_mem_assign_util.cc:65]182539 AssignStaticMemory2Node: ErrorNo: -1(failed)
[ERROR] GE(179560,python3.7):2020-10-31-11:06:40.669.315 [graphengine/ge/graph/build/memory/memory_assigner.cc:27]182539 AssignMemory: ErrorNo: -1(failed) Memory assigner failed
[ERROR] GE(179560,python3.7):2020-10-31-11:06:40.669.401 [graphengine/ge/graph/build/model_builder.cc:722]182539 BuildModelForGetTask: ErrorNo: -1(failed) Assign Memory Failed!

原因分析

框架默认将变量及图内存进行隔离管理,默认变量内存为5GB,图内存为26GB,变量内存超限的情况下,可以手工调整相应大小,但变量及图总内存大小不能超过31G。

解决措施

可以通过手工指定graph_memory_max_size和variable_memory_max_size的大小,来调整变量及图内存大小,例如:
from npu_bridge.npu_init import *

config = tf.ConfigProto()
custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["graph_memory_max_size"].s = tf.compat.as_bytes(str(16 * 1024 * 1024 * 1024))
custom_op.parameter_map["variable_memory_max_size"].s = tf.compat.as_bytes(str(15 * 1024 * 1024 * 1024))
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF

with tf.Session(config=config) as sess:
    sess.run(...)

本页内容

该页面对您有帮助吗?
我要评分