initialize_system

Description

Excludes the GE initialization time in the training time statistics. Generally, this API is not required for training. Before using the collective communication API, call this API to initialize the collective communication.

Prototype

def initialize_system(name = None)

Options

Option

Input/Output

Description

name

Input

Operator name

Returns

An operator for the user to initialize GE by using sess.run(op).

Restrictions

If the initialize_system API needs to be called and the following functions need to be enabled during training, the configuration must be performed when a session is started in initialize_system.

Table 1 Session configuration options in initialize_system

Configuration Option

Description

profiling_mode

Profiling enable.

  • True: enabled. The profiling options are determined by enable_options.
  • False (default): disabled.

profiling_options

Option (or options separated by colons) to be traced in profiling.

  • training_trace: iteration tracing. Collects software profile data of a training job and the AI Software Stack to profile the training job, with focuses on data augmentation, forward and backward propagation, and gradient aggregation and update.
  • task_trace: task tracing. Collects the HWTS and AI Core hardware information of the Ascend AI Processor and the start and end of each task.
  • op_trace: single-operator tracing. To do so, construct single-operator networks and train the network using a training script. Exclusive with training_trace and task_trace.

You can collect multiple items, which must be separated by colons (:), for example, training_trace:task_trace.

NOTE:
  • If training_trace is selected, fp_point and bp_point need to be configured.
  • If task_trace is selected, training_trace is enabled automatically.

fp_point

Required if training_trace is selected.

Start point of the forward propagated operator in iteration traces, to record the start timestamp of forward propagation.

Set the value to the name of the top operator in forward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain this name.

bp_point

Required if training_trace is selected.

End point of the backward propagated operator in iteration traces, to record the end timestamp of backward propagation. BP_POINT and FP_POINT are used to compute the time used by forward and backward propagation.

Set the value to the name of the bottom operator in backward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain this name.

enable_dump

Data dump enable.

  • True: enabled. The dump file path is read from dump_path. If dump_path is set to None, an exception occurs.
  • False (default): disabled.

dump_path

Dump path. Required when enable_dump or enable_dump_debug is set to True.

Create the specified path in advance in the environment (either in a container or on the host) where training is performed. The running user configured during installation must have the read and write permissions on this path. The path can be an absolute path or a relative path relative to the path where the training script is executed.

  • An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output.
  • A relative path starts with a directory name, for example, output.

dump_step

Iterations to dump. Defaults to None, indicating that all iterations are dumped.

Separate multiple iterations using vertical bars (|), for example, 0|5|10. You can also use hyphens (-) to specify the iteration range, for example, 0|3-5|10.

dump_mode

Dump mode. The values are as follows:

  • input: dumps only operator inputs.
  • output (default): dumps only operator outputs.
  • all: dumps both operator inputs and outputs.

enable_dump_debug

Overflow/Underflow detection enable.

  • True: enabled. The dump file path is read from dump_path. If dump_path is set to None, an exception occurs.
  • False (default): disabled.

dump_debug_mode

Overflow/Underflow detection mode.

  • aicore_overflow: detects AI Core operator overflow, outputting the abnormal extreme values (such as float16 65500, 38400, and 51200) when the inputs of the operator are normal. Once such fault is detected, analyze the cause of the overflow and modify the operator implementation based on the network requirements and operator logic.
  • atomic_overflow: detects Atomic Add overflow, for checking modules involved in floating-point computing (such as SDMA) in addition to AI Core.
  • all: detects both AI Core operator overflow and Atomic Add overflow.

precision_mode

A string for the operator precision mode.

  • allow_fp32_to_fp16: For cube operators, use float16. For vector operators, preserve the original precision. If operators in a network model support float32, preserve the original precision float32. If operators in the network model do not support float32, directly reduce the precision to float16.
  • force_fp16: forces float16 for operators supporting both float16 and float32. This parameter applies only to online inference scenarios.
  • cube_fp16in_fp32out/force_fp32: The system selects a proper processing mode based on the operator type for operators supporting both float16 and float32. The force_fp32 and cube_fp16in_fp32out configurations deliver the same effect. cube_fp16in_fp32out is newly added to the new version. For cube operators, this option has clearer semantics.
    • For cube operators, the system processes the computing based on the operator implementation.
      1. The preferred input data type is float16 and the output data type is float32.
      2. If the scenario where the input data type is float16 and the output data type is float32 is not supported, set both the input and output data types to float32.
      3. If the scenario where both the input and output data types are float32 is not supported, set both the input and output data types to float16.
      4. If none of the preceding scenarios is supported, an error is reported.
    • For vector operators, float32 is forcibly selected for operators supporting both float16 and float32, even if the original precision is float16. This argument is invalid if your network model contains operators not supporting float32, for example, operators that support only float16. In this case, float16 is preserved. If the operators do not support float32 and are configured to the blocklist for mixed precision (by setting precision_reduce to false), the counterpart AI CPU operators supporting float32 are used.
  • must_keep_origin_dtype: preserves the original precision.
    • If the precision of an operator in the original graph is float16, and the implementation of the operator in the NPU does not support float16 but supports only float32 and bfloat16, the system automatically uses high-precision float32.
    • If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only bfloat16, the AI CPU operator of float16 is used. If the AI CPU operator is not supported, an error is reported.
    • If the precision of an operator in the original graph is float32, and the implementation of the operator in the AI Core does not support float32 but supports only float16, the AI CPU operator of float32 is used. If the AI CPU operator is not supported, an error is reported.
  • allow_mix_precision_fp16/allow_mix_precision: enables automatic mixed precision, indicating that both float16 and float32 are used for neural network processing.

    The allow_mix_precision and allow_mix_precision_fp16 configurations deliver the same effect. allow_mix_precision_fp16 is newly added to the new version. It has clearer semantics and is easier to understand. For certain operators of the float32 data type on a network, the system automatically reduces their precision to float16 based on the built-in tuning policy. This will improve system performance and reduce memory footprint with minimal accuracy degradation. Use the mixed precision mode in conjunction with loss scaling to compensate for the precision degradation caused by precision reduction.

For the Atlas Training Series Product, the default value is allow_fp32_to_fp16.

graph_run_mode

Graph run mode.

  • 0: online inference.
  • 1 (default): training.

op_debug_level

Operator debug enable.

  • 0 (default): disables operator debug.
  • 1: enables operator debug and generates a TBE instruction mapping file. In this case, an operator CCE file (.cce), a Python-CCE mapping file (_loc.json), and operator .o and .json files are generated in the kernel_meta folder in the training script execution directory. You can locate the AI Core error by using the line numbers in the CCE code and TBE code of the error operator.
  • 2: enables operator debug and generates a TBE instruction mapping file. In this case, an operator CCE file (.cce), a Python-CCE mapping file (_loc.json), and operator .o and .json files are generated in the kernel_meta folder in the training script execution directory, and the build optimization is disabled by enabling the CCE compiler -O0-g. You can locate the AI Core error by using the line numbers in the CCE code and TBE code of the error operator.
  • 3: disables operator debug. The operator .o and .json files are retained in the kernel_meta folder in the training script execution directory.
  • 4: disables operator debug. The operator binary (.o) and operator description file (.json) are retained, and a TBE instruction mapping file (.cce) and a UB fusion description file ({$kernel_name}_compute.json) are generated in the kernel_meta folder under the training script execution directory.
    NOTICE:

    You are advised to set this option to 0 or 3 for training. To locate AI Core errors, set this option to 1 or 2, which might compromise the network performance.

enable_exception_dump

Dump enable for input and output of abnormal operators. The dump information is generated in the current script execution directory.

  • 0 (default): disabled.
  • 1: enabled.

op_select_implmode

Operator implementation mode select. Some operators built in the Ascend AI Processor can be implemented in either high-precision or high-performance mode.

  • high_precision: high precision implementation mode. In high-precision mode, Newton's Method or Taylor's Formula is used to improve operator precision with fp16 input.
  • high_performance (default): high performance implementation mode. The high-performance implementation mode refers to the optimal performance implementation without affecting the network precision with fp16 input.

optypelist_for_implmode

List of operator types. The operators in the list use the mode specified by OP_SELECT_IMPL_MODE. Currently, only Pooling operator is supported.

This option is used in pair with OP_SELECT_IMPL_MODE, for example:

Set op_select_implmode to high_precision.

Set optypelist_for_implmode to Pooling.

Example

If you use an HCCL API such as get_local_rank_id, get_rank_size, or get_rank_id before sess.run() or estimator.train(), you need to start another session and execute initialize_system to initialize collective communication. After the training is complete, execute shutdown_system and close the session.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import tensorflow as tf
from npu_bridge.npu_init import *

npu_int = npu_ops.initialize_system()
npu_shutdown = npu_ops.shutdown_system()

config = tf.ConfigProto()
custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name =  "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF

init_sess = tf.Session(config=config)
init_sess.run(npu_int)

# Call an HCCL API...
# Perform training...

init_sess.run(npu_shutdown)
init_sess.close()

Or:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import tensorflow as tf
from npu_bridge.npu_init import *

npu_init = npu_ops.initialize_system()
npu_shutdown = npu_ops.shutdown_system()

config = tf.ConfigProto()
custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name =  "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF

with tf.Session(config=config) as sess:
    sess.run(npu_init)
    # Call an HCCL API...
    # Perform training...
    sess.run(npu_shutdown)