集合通信初始化
使用集合通信接口前首先需要进行集合通信初始化,当前集合通信的初始化隐藏在initialize_system接口中,如果在sess.run或者estimator.train之前调用get_local_rank_id/get_rank_size/get_rank_id等集合通信接口,需要先另起session执行initialize_system,进行集合通信初始化,然后在训练结束后执行shutdown_system,同时关闭session。
需要注意:如果在sess.run或者estimator.train之后又调用了集合通信接口,由于sess.run或者estimator.train后系统会自动关闭集合通信初始化session,因此需要再次进行集合通信初始化。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 | import tensorflow as tf from npu_bridge.npu_init import * # 集合通信初始化定义 npu_int = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system() # 执行训练时需要使能某些特定功能时,必须此处传入,具体请参考initialize_system接口说明 config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF init_sess = tf.Session(config=config) # 进行集合通信初始化 init_sess.run(npu_int) #调用集合通信接口... #执行训练...此处若另起session执行训练,传入的运行参数和上述一致 #如果训练之后再次调用HCCL接口,在调用前需要再次进行集合通信初始化 init_sess.run(npu_shutdown) init_sess.close() |
或者:
import tensorflow as tf from npu_bridge.npu_init import * npu_init = npu_ops.initialize_system() npu_shutdown = npu_ops.shutdown_system() # 如果执行训练时需要使能某些特定功能时,必须此处传入,具体请参考initialize_system接口说明 config = tf.ConfigProto() custom_op = config.graph_options.rewrite_options.custom_optimizers.add() custom_op.name = "NpuOptimizer" custom_op.parameter_map["use_off_line"].b = True config.graph_options.rewrite_options.remapping = RewriterConfig.OFF config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF with tf.Session(config=config) as sess: # 进行集合通信初始化 sess.run(npu_init) #调用集合通信接口... #执行训练... sess.run(npu_shutdown)
父主题: 更多特性