集合通信初始化

如果在sess.run或者estimator.train之前调用get_local_rank_id/get_rank_size/get_rank_id等HCCL接口,需要先另起session执行initialize_system,进行集合通信初始化,然后在训练结束后执行shutdown_system,同时关闭session。

需要注意的是,如果在sess.run或者estimator.train之后又调用了集合通信接口,由于sess.run或者estimator.train后系统会自动关闭集合通信初始化session,因此需要再次进行集合通信初始化。

import tensorflow as tf
from npu_bridge.npu_init import *

npu_int = npu_ops.initialize_system()
npu_shutdown = npu_ops.shutdown_system()

# 执行训练时需要使能某些特定功能时,必须此处传入,具体请参考initialize_system接口说明
config = tf.ConfigProto()
custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name =  "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF

init_sess = tf.Session(config=config)
init_sess.run(npu_int)

#调用HCCL接口...
#执行训练...此处若另起session执行训练,传入的运行参数和上述一致
#如果训练之后再次调用HCCL接口,在调用前需要再次进行集合通信初始化

init_sess.run(npu_shutdown)
init_sess.close()

或者:

import tensorflow as tf
from npu_bridge.npu_init import *

npu_init = npu_ops.initialize_system()
npu_shutdown = npu_ops.shutdown_system()

# 如果执行训练时需要使能某些特定功能时,必须此处传入,具体请参考initialize_system接口说明
config = tf.ConfigProto()
custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name =  "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF

with tf.Session(config=config) as sess:
    sess.run(npu_init)
    #调用HCCL接口...
    #执行训练...
    sess.run(npu_shutdown)