Initializing Collective Communication

Before using collective communication APIs, initialize collective communication first. The initialization is hidden in the initialize_system API. If you use an HCCL API such as get_local_rank_id, get_rank_size, or get_rank_id before sess.run() or estimator.train(), you need to start another session and execute initialize_system to initialize collective communication. After the training is complete, execute shutdown_system and close the session.

Note that if these HCCL APIs are called after the sess.run() or estimator.train() call, initialization needs to be performed again because the session for collective communication initialization will be closed automatically.

Code snippet example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import tensorflow as tf
from npu_bridge.npu_init import *

npu_init = npu_ops.initialize_system()
npu_shutdown = npu_ops.shutdown_system()

config = tf.ConfigProto()
custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name =  "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF

# Example of graph execution logic
a = tf.placeholder(tf.int32, (None,None))
b = tf.placeholder(tf.int32, (None,None))
add = tf.add(a, b)

with tf.Session(config=config) as sess:
   # Initialize collective communication.
    sess.run(npu_init)

    # <!---- Call collective communication APIs. Fill the code as required. ---->

    # Example of training
    result=sess.run(add, feed_dict={a: [[-20, 2],[1,3]],b: [[1],[-21]]})
    # Close the session.
    sess.run(npu_shutdown)