initialize_system

Applicability

Product

Supported

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products

Atlas inference products

Atlas training products

Description

Excludes the GE initialization time in the training time statistics. Generally, this API is not required for training. Before using the collective communication API, call this API to initialize the collective communication.

Prototype

1
def initialize_system(name = None)

Parameters

Parameter

Input/Output

Description

name

Input

Operator name

Returns

An operator for the user to initialize GE by using sess.run(op)

Restrictions

If the initialize_system API needs to be called and the following functions need to be enabled during training, the configuration must be performed when a session is started in initialize_system.

Table 1 Session configuration options in initialize_system

Configuration Option

Description

profiling_mode

Whether to enable profiling.

  • True: enabled. The profiling options are determined by profiling_options.
  • False (default): disabled.

profiling_options

Option (or options separated by colons) to be traced in profiling.

  • output: path for storing profiling result files. Both absolute path and relative path (relative to the path where the command is run) are supported. The path cannot contain the following special characters: "\n", "\f", "\r", "\b", "\t", "\v", and "\u007F".
    • An absolute path starts with a slash (/), for example, /home/output.
    • A relative path starts with a directory name, for example, output.
    • It takes precedence over ASCEND_WORK_PATH.
    • This path does not need to be created in advance because it is automatically created during collection.
  • storage_limit: maximum size of files that can be stored in a specified disk directory. If the size of profile data files in the disk is about to use up the maximum storage space specified by this option or the total remaining disk space is about to be used up (remaining space ≤ 20 MB), the earliest files in the disk are aged and deleted.

    The value range is [200, 4294967295], and the unit is MB. The unit must be included when you set this parameter, for example, 200 MB.

    If this parameter is not set, the default value is 90% of the available space of the disk where the directory for storing profile data files is located.

  • training_trace: iteration tracing switch. Collects software profile data of a training job and the AI Software Stack to profile the training job. Focuses on forward and backward propagation, and gradient aggregation and update. This option must be set to on when the forward and backward propagation operator data is collected.
  • task_trace and task_time: switches that control collection of the operator delivery and execution durations. Related duration data must be output to the task_time, op_summary, and op_statistic files. Possible configuration values are as follows:
    • on: switch on. This is the default value, delivering the same effect as l1.
    • off: switch off.
    • l0: collects operator delivery and execution duration data. Compared with l1, l0 does not collect basic operator information, so the performance overhead during collection is smaller, and this enables more accurate collection of statistics on time duration data.
    • l1: collects operator delivery and execution duration data, as well as basic operator information, to provide more comprehensive performance analysis data.

    When Profiling is enabled to collect training data, task_trace and training_trace must be set to on.

  • ge_api: switch that controls collection of the time consumption data of dynamic-shape operators in the host scheduling phase. Possible values are:
    • off: switch off. The default value is off.
    • l0: collects the time consumption data of dynamic-shape operators in the main host scheduling phase to facilitate accurate statistics.
    • l1: collects finer-grained time consumption data of dynamic-shape operators in the host scheduling phase to provide more comprehensive performance analysis data.
  • hccl: communication data collection switch, either on or off (default).
    NOTE:

    This switch will be deprecated in later versions. To control data collection, use task_time.

  • aicpu: whether to collect details about the AI CPU operator, such as the operator execution time and data copy time. The value can be on or off (default).
  • fp_point: start point of the forward propagated operator in iteration traces, to record the start timestamp of forward propagation. Set the value to the name of the top operator in forward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "fp_point":""), and the system will automatically identify the start point of the forward propagated operator.
  • bp_point: end point of the backward propagated operator in iteration traces, to record the end timestamp of backward propagation. bp_point and fp_point are used to compute the time used by forward and backward propagation. Set the value to the name of the bottom operator in backward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "bp_point":""), and the system will automatically identify the end point of the backward propagated operator.
  • aic_metrics: AI Core metric to profile. The options are as follows:
    • ArithmeticUtilization: arithmetic utilization ratio.
    • PipeUtilization (default): ratio of time taken by the compute units to that of MTEs.
    • Memory: ratio of external memory read/write instructions.
    • MemoryL0: ratio of internal memory L0 read/write instructions.
    • MemoryUB: ratio of internal memory UB read/write instructions.
    • ResourceConflictRatio: ratio of pipeline queue instructions.
    • L2Cache: read/write L2 cache hits and re-allocations after cache misses

      Atlas inference products: This parameter is not supported.

      Atlas training products: This parameter is not supported.

    • MemoryAccess: bandwidth of the operator's memory access on cores.

      Atlas inference products: This parameter is not supported.

      Atlas training products: This parameter is not supported.

    NOTE:
    The registers whose data is to be collected can be customized, for example, "aic_metrics":"Custom:0x49,0x8,0x15,0x1b,0x64,0x10".
    • The Custom field indicates the customization type. It is set to specific register values in the range of [0x1, 0x6E].
    • A maximum of eight registers can be configured, which are separated with commas (,).
    • The register value can be in hexadecimal or decimal format.
  • l2: switch that controls L2 cache and TLB page table cache hit ratio, either on or off (default).
    • Atlas inference products: supports collection of the L2 cache hit ratio.
    • Atlas training products: supports collection of the L2 cache hit ratio.
    • Atlas A2 training products/Atlas A2 inference products: supports collection of L2 cache and TLB page table cache hit ratio. aic-metrics=L2Cache is recommended for analyzing the number of hits on L2 cache from the AI Core.
    • Atlas A3 training products/Atlas A3 inference products: supports collection of L2 cache and TLB page table cache hit ratio. aic-metrics=L2Cache is recommended for analyzing the number of hits on L2 cache from the AI Core.
  • msproftx: switch that controls the msproftx user and upper-layer framework program to output profile data, either on or off (default).

    Add the following mstx API or msproftx API to the application script. The mstx API is recommended.

  • runtime_api: runtime API data collection switch, either on or off (default). You can collect runtime API profile data, including the synchronous/asynchronous memory replication latencies between the host and device and between devices.
  • sys_hardware_mem_freq: switch that controls the collection of the on-chip memory, QoS transmission bandwidth, LLC L3 cache bandwidth, accelerator bandwidth, SoC transmission bandwidth, and component memory usage. The collected content varies depending on the product. The actual result prevails. The value range is [1, 100], in Hz.

    Sampling memory data in the environment where glibc (2.34 or earlier) is installed may trigger a known Bug 19329. This problem can be solved by upgrading the glibc version.

    NOTE:

    For the following products, you are advised not to increase the profiling frequency after the profiling task is complete. Otherwise, SoC transmission bandwidth data may be lost.

    Atlas 200I/500 A2 inference products

    Atlas A2 training products/Atlas A2 inference products

    Atlas A3 training products/Atlas A3 inference products

  • llc_profiling: LLC events to profile. Possible values are as follows:
    • read (default): read events, that is, the L3 cache read rate.
    • write: write events, that is, the L3 cache write rate.
  • sys_io_sampling_freq: NIC and ROCE data collection frequency. The value range is [1, 100], in Hz.

    Atlas inference products: This parameter is not supported.

    Atlas A2 training products/Atlas A2 inference products: supports NIC and RoCE collection.

    Atlas A3 training products/Atlas A3 inference products: supports NIC and RoCE collection.

  • sys_interconnection_freq: frequency of collecting collective communication bandwidth data (HCCS), SIO data, PCIe data and inter-chip transmission bandwidth information. The value range is [1, 50], in Hz.
    • Atlas training products: supports HCCS and PCIe data collection.
    • Atlas A2 training products/Atlas A2 inference products: supports HCCS, PCIe data, and inter-chip transmission bandwidth information collection.
    • Atlas A3 training products/Atlas A3 inference products: supports HCCS, PCIe data, inter-chip transmission bandwidth information, and SIO data collection.
  • dvpp_freq: DVPP collection frequency. The value range is [1, 100], in Hz.
  • instr_profiling: AI Core and AI Vector bandwidth and latency collection switch. The value can be on or off (default).
    • Atlas training products: This function is not supported.
    • Atlas A2 training products/Atlas A2 inference products: This switch is not supported. This function is controlled through instr_profiling_freq.
    • Atlas A3 training products/Atlas A3 inference products: This switch is not supported. This function is controlled through instr_profiling_freq.
  • instr_profiling_freq: AI Core and AI Vector bandwidth and latency collection switch. If the collection frequency is configured, the related collection capability is enabled. The value range is [300, 30000]. The unit is Hz.
    • Atlas training products: This function is not supported.
    • Atlas A2 training products/Atlas A2 inference products: supported. However, instr_profiling_freq is mutually exclusive with training_trace, task_trace, hccl, aicpu, fp_point, bp_point, aic_metrics, l2, task_time, and runtime_api, so they cannot be executed at the same time.
    • Atlas A3 training products/Atlas A3 inference products: supported. However, instr_profiling_freq is mutually exclusive with training_trace, task_trace, hccl, aicpu, fp_point, bp_point, aic_metrics, l2, task_time, and runtime_api, so they cannot be executed at the same time.
  • host_sys: switch for collecting host profile data. You can select one or more options and separate them with commas (,), for example, "host_sys": "cpu,mem".
    • cpu: process CPU utilization
    • mem: process memory utilization
  • host_sys_usage: Host-side system and process CPU and memory data collection option, selected from cpu and mem. You can select one or more options and separate them with commas (,).
  • host_sys_usage_freq: Host-side system and process CPU and memory data collection frequency. The value range is [1, 50] and the default value is 50. The unit is Hz.
NOTE:
  • fp_point and bp_point require manual configuration only in the dynamic shape scenario. In the dynamic shape scenario, fp_point and bp_point must be configured manually.
  • Online inference supports task_trace and aicpu but does not support training_trace.

enable_dump

Whether to enable the data dump function.

  • True: enabled. The dump file path is read from dump_path. If dump_path is set to None, an exception occurs.
  • False (default): disabled.
NOTE:
  • Data dump and overflow/underflow data collection cannot be enabled at the same time. That is, enable_dump and enable_dump_debug cannot be both set to True.
  • If either enable_dump or enable_dump_debug is set to True and enable_exception_dump is set to 1 (indicating that common ExceptionDump function is enabled): For dynamic-shape networks, only enable_exception_dump takes effect. For static-shape networks, enable_exception_dump and either enable_dump or enable_dump_debug take effect.

dump_path

Dump path. Required when enable_dump or enable_dump_debug is set to True.

Create the specified path in advance in the environment (either in a container or on the host) where training is performed. The running user configured during installation must have the read and write permissions on this path. The path can be an absolute path or a relative path relative to the path where the training script is executed.

  • An absolute path starts with a slash (/), for example, /home/test/output.
  • A relative path starts with a directory name, for example, output.

dump_step

Iterations to dump. Defaults to None, indicating that all iterations are dumped.

Separate multiple iterations using vertical bars (|), for example, 0|5|10. You can also use hyphens (-) to specify the iteration range, for example, 0|3-5|10.

dump_mode

Dump mode. The values are as follows:

  • input: dumps only operator inputs.
  • output (default): dumps only operator outputs.
  • all: dumps both operator inputs and outputs.
NOTE:

If this option is set to all, the input data of some operators, such as collective communication operators HcomAllGather and HcomAllReduce, will be modified during operator execution. Therefore, the system dumps the operator input before operator execution and dumps the operator output after execution. In this way, the dumped input and output data of the same operator is flushed to disks separately, and multiple dump files are generated. After parsing the dump files, you can determine whether the data is an input or output based on the file content.

enable_dump_debug

Whether to enable overflow/underflow detection.

  • True: enabled. The dump file path is read from dump_path. If dump_path is set to None, an exception occurs.
  • False (default): disabled.
    NOTE:
    • Data dump and overflow/underflow data collection cannot be enabled at the same time. That is, enable_dump and enable_dump_debug cannot be both set to True.
    • If either enable_dump or enable_dump_debug is set to True and enable_exception_dump is set to 1 (indicating that common ExceptionDump function is enabled): For dynamic-shape networks, only enable_exception_dump takes effect. For static-shape networks, enable_exception_dump and either enable_dump or enable_dump_debug take effect.

dump_debug_mode

Overflow/Underflow detection mode. The values are as follows:
  • aicore_overflow: detects AI Core operator overflow/underflow, that is, detecting whether abnormal extreme values (such as 65500, 38400, and 51200 in float16) are output with normal inputs. Once such fault is detected, analyze the cause of the overflow/underflow and modify the operator implementation based on the network requirements and operator logic.
  • atomic_overflow: detects Atomic Add overflow/underflow. Atomic Add overflow/underflow is detected when data is transferred from the UB to OUT after AI Core computation.
  • all: detects overflow/underflow of both AI Core operators and Atomic Add. The default value is all.
    NOTE:

    For Atlas A2 training products/Atlas A2 inference products, only the default value all can be used.

precision_mode

A string for the operator precision mode.

  • allow_fp32_to_fp16
    • For matrix operators:
      • If the operator precision in the original graph is float32, the precision is preferably reduced to float16. If the operator in the AI Core does not support float16, float32 is used. If the operator in the AI Core does not support float32, the AI CPU operator is used for computation. If the AI CPU operator also does not support float32, an error is reported during execution.
      • If the operator precision in the original graph is bfloat16, the precision of the original graph is preferably used. If the operator in the AI Core does not support bfloat16, float32 is used. If the operator in the AI Core does not support float32, the precision is directly reduced to float16. If the operator in the AI Core does not support float16, the AI CPU operator is used for computation. If the AI CPU operator also does not support float16, an error is reported during execution.
    • For vector operators, the precision of the original graph is retained preferably.
      • If the operator precision in the original graph is float32, the precision of the original graph is preferably used. If the operator in the AI Core does not support float32, the precision is directly reduced to float16. If the operator in the AI Core does not support float16, the AI CPU operator is used for computation. If the AI CPU operator also does not support float16, an error is reported during execution.
      • If the operator precision in the original graph is bfloat16, the precision of the original graph is preferably used. If the operator in the AI Core does not support bfloat16, float32 is used. If the operator in the AI Core does not support float32, the precision is directly reduced to float16. If the operator in the AI Core does not support float16, the AI CPU operator is used for computation. If the AI CPU operator also does not support float16, an error is reported during execution.
  • force_fp16

    Forces float16 for operators supporting float16, bfloat16, and float32. This parameter applies only to online inference scenarios.

  • force_fp32/cube_fp16in_fp32out
    force_fp32 and cube_fp16in_fp32out have the same effect. This option indicates that the system selects different processing modes based on the operator type when the operator in the AI Core supports both the float32 and float16 data types. cube_fp16in_fp32out is newly added to the new version. For cube operators, this option has clearer semantics.
    • For cube operators, the system processes the computation based on the operator implementation.
      1. The preferred input data type is float16 and the output data type is float32.
      2. If the float16 input data and float32 output data types are not supported, set both the input and output data types to float32.
      3. If the float32 input and output data types are not supported, set both the input and output data types to float16.
      4. If the float16 input and output data types are not supported, an error is reported.
    • For vector compute operators, the operator precision in the original graph is float16 or bfloat16, and float32 is forcibly selected.

      This option is invalid if the original graph contains operators not supporting float32 in the AI Core, for example, an operator that supports only float16. In this case, float16 is retained. If the operator in the AI Core does not support float32 and it is configured to the blocklist of precision reduction (by setting precision_reduce to false), the counterpart AI CPU operator supporting float32 is used. If the AI CPU operator does not support float32, an error is reported.

  • must_keep_origin_dtype
    Retains the original precision.
    • If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only float32 and bfloat16, the system automatically uses high-precision float32.
    • If the precision of an operator in the original graph is float16, and the implementation of the operator in the AI Core does not support float16 but supports only bfloat16, the AI CPU operator of float16 is used. If the AI CPU operator is not supported, an error is reported.
    • If the precision of an operator in the original graph is float32, and the implementation of the operator in the AI Core does not support float32 but supports only float16, the AI CPU operator of float32 is used. If the AI CPU operator is not supported, an error is reported.
  • allow_mix_precision_fp16/allow_mix_precision

    allow_mix_precision has the same effect as that of allow_mix_precision_fp16, indicating that mixed precision of float16, bfloat16, and float32 is used for neural network processing. allow_mix_precision_fp16 is newly added to the new version, which has clearer semantics for easy understanding.

    For float32 and befloat16 operators in the original model, float16 is automatically used for certain float32 and bfloat16 operators based on the built-in tuning policy. This will improve system performance and reduce memory usage with minimal precision degradation.

  • allow_mix_precision_bf16

    Mixed precision of bfloat16 and float32 is used for neural network processing. In this mode, bfloat16 is automatically used for certain float32 operators on the original model based on the built-in tuning policy. This will improve system performance and reduce memory usage with minimal precision degradation. If the operator in the AI Core does not support bfloat16 and float32, the AI CPU operator is used for computation. If AI CPU operator also does not support bfloat16 and float32, an error is reported during execution.

    Note: This configuration is supported only by Atlas A3 training products/Atlas A3 inference productsAtlas A2 training products/Atlas A2 inference products.

  • allow_fp32_to_bf16
    • If the operator precision in the original graph is float32, the precision of the original graph is preferably used. If the operator in the AI Core does not support float32, the precision is reduced to bfloat16. If the operator in the AI Core does not support bfloat16, the AI CPU operator is used for computation. If the AI CPU operator also does not support bfloat16, an error is reported during execution.
    • If the operator precision in the original graph is bfloat16, the precision of the original graph is preferably used. If the operator in the AI Core does not support bfloat16, float32 is used. If the operator in the AI Core does not support float32, the AI CPU operator is used for computation. If the AI CPU operator also does not support float32, an error is reported during execution.

    Note: This configuration is supported by Atlas A3 training products/Atlas A3 inference productsAtlas A2 training products/Atlas A2 inference products.

For the Atlas training products, the default value is allow_fp32_to_fp16.

For the Atlas A2 training products/Atlas A2 inference products, the default value is must_keep_origin_dtype.

graph_run_mode

Graph run mode.

  • 0: online inference.
  • 1 (default): training.

op_debug_level

Whether to enable operator debugging.

  • 0: disables operator debug.
  • 1: Enables operator debug. TBE instruction mapping files are generated in the kernel_meta directory under the training script execution path, including operator CCE files (.cce), Python-CCE mapping files (_loc.json), .o files, and .json files. These files are used for AI Core error analysis with related tools.
  • 2: Enables operator debug. TBE instruction mapping files are generated in the kernel_meta directory under the training script execution path, including operator CCE files (.cce), Python-CCE mapping files (_loc.json), .o files, and .json files. The compilation optimization of the CCE compiler is disabled and the CCE compiler debugging function is enabled (by setting the compiler option to -O0-g). These files are used for AI Core error analysis with related tools.
  • 3: disables operator debug. The operator .o and .json files are retained in the kernel_meta folder in the training script execution directory.
  • 4: disables operator debug. The operator binary (.o) and operator description file (.json) are retained, and a TBE instruction mapping file (.cce) and a UB fusion description file ({$kernel_name}_compute.json) are generated in the kernel_meta folder under the training script execution directory.
    NOTICE:
    • If this option is set to 0 and op_debug_config is configured, the operator compilation directory kernel_meta is still generated in the current execution path during training. The content generated in the directory is subject to op_debug_config.
    • You are advised to set this option to 0 or 3 for training. To locate AI Core errors, set this parameter to 1 or 2, which might compromise the network performance.
    • If this option is set to 2 (the CCE compiler is enabled), it cannot be used together with the oom option in op_debug_config. Otherwise, an AI Core error is reported. The following is an example of the error message:
      ...there is an aivec error exception, core id is 49, error code = 0x4 ...
    • If this parameter is set to 2 (the CCE compiler is enabled), the size of the operator kernel file (*.o file) increases. In dynamic shape scenarios, all possible scenarios are traversed during operator build, which may cause operator build failures due to large operator kernel files. In this case, 2 is not recommended.

      If the build failure is caused by the large operator kernel file, the following log is displayed:

      message:link error ld.lld: error: InputSection too large for range extension thunk ./kernel_meta_xxxxx.o:(xxxx)
    • If the value of this option is not 0, you can use the debug_dir option to specify the path for storing debugging-related process files.
    • If this option is set to 0 and NPU_COLLECT_PATH is set, the operator compilation directory kernel_meta is generated in the current path after the command is executed. If ASCEND_WORK_PATH is set, kernel_meta is generated in the path specified by the environment variable. For details about the environment variable, see Environment Variables.
    • When the debug function is enabled, if the model contains the following merged compute and communication (MC2) operators, the *.o, *.json, and *.cce files of the operators are not generated in the operator build folder kernel_meta.

      MatMulAllReduce

      MatMulAllReduceAddRmsNorm

      AllGatherMatMul

      MatMulReduceScatter

      AlltoAllAllGatherBatchMatMul

      BatchMatMulReduceScatterAlltoAll

enable_exception_dump

Whether to dump data of exception operators.
  • 0: Disables the exception operator data dump function.
  • 1: Enables the common ExceptionDump function to dump the input and output data, tensor description information (such as shape, dtype, and format), and workspace information of exception operators.

    The dump data is stored in the following directories in descending order of priority: NPU_COLLECT_PATH > ASCEND_WORK_PATH > default directory (extra-info in the script execution directory).

  • 2 (default): Enables the LiteExceptionDump function to dump the input and output data, workspace information, and tiling information of exception operators. The exported data is used to analyze AI Core errors. For details about how to collect and locate AI Core errors, see "Typical Faults > AI Core Error Locating" in Troubleshooting.

    The dump data is stored in the following directories in descending order of priority: ASCEND_WORK_PATH > default directory (extra-info/data-dump/<device_id> in the script execution directory).

NOTE:

If the environment variable NPU_COLLECT_PATH is configured, exception operator data is dumped in accordance with mode 1 (common ExceptionDump) regardless of the value of enable_exception_dump, and the dump data is stored in the directory specified by NPU_COLLECT_PATH.

op_select_implmode

Operator implementation mode select. Some operators built in the Ascend AI Processor can be implemented in either high-precision or high-performance mode.

  • high_precision: high precision implementation mode. In high-precision mode, Newton's Method or Taylor's Formula is used to improve operator precision with fp16 input.
  • high_performance (default): high performance implementation mode. The high-performance implementation mode refers to the optimal performance implementation without affecting the network precision with fp16 input.

optypelist_for_implmode

List of operator types (separated by commas) that use the mode specified by the op_select_implmode parameter. Currently, Pooling, SoftmaxV2, LRN, and ROIAlign operators are supported.

Use this parameter in conjunction with op_select_implmode, for example:

Set op_select_implmode to high_precision.

Set optypelist_for_implmode to Pooling.

This parameter is left empty by default, indicating that the configuration is disabled.

Example

If you use an HCCL API such as get_local_rank_id, get_rank_size, or get_rank_id before sess.run() or estimator.train(), you need to start another session and execute initialize_system to initialize collective communication. After the training is complete, execute shutdown_system and close the session.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
import tensorflow as tf
from npu_bridge.npu_init import *

npu_int = npu_ops.initialize_system()
npu_shutdown = npu_ops.shutdown_system()

config = tf.ConfigProto()
custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name =  "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF

init_sess = tf.Session(config=config)
init_sess.run(npu_int)

# Call an HCCL API...
# Perform training...

init_sess.run(npu_shutdown)
init_sess.close()

Or:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
import tensorflow as tf
from npu_bridge.npu_init import *

npu_init = npu_ops.initialize_system()
npu_shutdown = npu_ops.shutdown_system()

config = tf.ConfigProto()
custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name =  "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF
config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF

with tf.Session(config=config) as sess:
    sess.run(npu_init)
    # Call an HCCL API...
    # Perform training...
    sess.run(npu_shutdown)