Using TensorFlow APIs for Data Profiling

Overview

For TensorFlow training or online inference, simply use TensorFlow APIs in your training script to activate profiling. The following features are available:

  • Global profiling: Profile data of all behaviors executed by graphs, in large data volume.

    You can either modify the training script and configure profiling_mode (as elaborated in this section), or set the environment variable PROFILING_MODE (see Profiling with Environment Variables). If both are used, profiling_mode takes precedence over PROFILING_MODE.

  • Local profiling: Profile data of specified subgraphs or steps. Use the WITH statement to call the profiler class and put the operations for which data profiling needs to be enabled into the scope of the profiler class.

This section describes how to enable global profiling. For more information, see TensorFlow 1.15 Model Porting Guide and TensorFlow 2.6.5 Model Porting Guide.

Prerequisites

Before enabling profiling, ensure that the training or online inference script can be executed properly.

Procedure

  1. Configure the following information in the training script. The following uses the TensorFlow 1.15 manual porting script as an example.
    • In Estimator mode, you can enable task_trace to profile task trace data. The sample is as follows:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      from npu_bridge.estimator.npu.npu_config import NPURunConfig
      from npu_bridge.estimator.npu.npu_config import ProfilingConfig
      from npu_bridge.npu_init import *
      
      # enable_profiling: profiling enable.
      # output: path for storing profile data. Create the specified directory in the training environment (container or host) in advance. The running user configured during installation must have the read and write permissions on this path. It can be either an absolute path or a relative path.
      # task_trace: task trace collection enable.
      profiling_options = '{"output":"/home/HwHiAiUser/output","task_trace":"on"}'
      profiling_config = ProfilingConfig(enable_profiling=True, profiling_options= profiling_options)
      session_config=tf.ConfigProto()
      
      config = NPURunConfig(profiling_config=profiling_config, session_config=session_config)
      

      If the problem cannot be spotted, enable training_trace to profile iteration traces. The sample is as follows:

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      11
      12
      13
      14
      15
      from npu_bridge.estimator.npu.npu_config import NPURunConfig
      from npu_bridge.estimator.npu.npu_config import ProfilingConfig
      from npu_bridge.npu_init import *
      
      # enable_profiling: profiling enable
      # output: path for storing profile data
      # task_trace: task trace collection enable
      # training_trace: iteration trace collection enable
      # fp_point: start point of the forward propagated operator in iteration traces, recording the start timestamp of forward propagation.
      # bp_point: end point of the backward propagated operator in iteration traces, recording the end timestamp of backward propagation. fp_point and bp_point are used to compute the time used by forward and backward propagation.
      profiling_options = '{"output":"/home/HwHiAiUser/output","task_trace":"on","training_trace":"on","aicpu":"on","fp_point":"","bp_point":"","aic_metrics":"PipeUtilization"}'
      profiling_config = ProfilingConfig(enable_profiling=True, profiling_options= profiling_options)
      session_config=tf.ConfigProto(allow_soft_placement=True)
      
      config = NPURunConfig(profiling_config=profiling_config, session_config=session_config)
      
    • In sess.run mode, you can enable task_trace to profile task trace data. The sample is as follows:
       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()
      custom_op.name =  "NpuOptimizer"
      custom_op.parameter_map["use_off_line"].b = True
      custom_op.parameter_map["profiling_mode"].b = True
      custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes('{"output":"/home/HwHiAiUser/output","task_trace":"on"}')
      config.graph_options.rewrite_options.remapping = RewriterConfig.OFF
      config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF
      
      with tf.Session(config=config) as sess:
        sess.run()
      

      If the problem cannot be spotted, enable training_trace to profile iteration traces. The sample is as follows:

       1
       2
       3
       4
       5
       6
       7
       8
       9
      10
      custom_op =  config.graph_options.rewrite_options.custom_optimizers.add()
      custom_op.name =  "NpuOptimizer"
      custom_op.parameter_map["use_off_line"].b = True
      custom_op.parameter_map["profiling_mode"].b = True
      custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes('{"output":"/home/HwHiAiUser/output","task_trace":"on","training_trace":"on","aicpu":"on","fp_point":"","bp_point":"","aic_metrics":"PipeUtilization"}')
      config.graph_options.rewrite_options.remapping = RewriterConfig.OFF
      config.graph_options.rewrite_options.memory_optimization = RewriterConfig.OFF
      
      with tf.Session(config=config) as sess:
        sess.run()
      
    • For details about the profiling configuration, see Profiling Options.
    • In sess.run mode when profiling_mode is set to true or in Estimator mode when enable_profiling is set to true, if profiling_options is not configured, data of training_trace, task_trace, hccl, aicpu, and aic_metrics (PipeUtilization) will be profiled and saved in the current AI task directory by default.
    • When configuring fp_point and bp_point, you may not find any data no matter whether you have specified an operator or used the automatic search algorithm (fp_point and bp_point are left empty). As a result, values of FP_BP, Grad_refresh Bound, and Data_aug Bound are empty in the parsed iteration trace data.
  2. Re-execute the training script.

    After the training is complete, the PROF_XXX folder is generated in the directory specified by the output parameter to store the raw profile data.

  3. Run the msprof command to parse the profile data. For details, see Offline Parsing.
    msprof --export=on --output=/home/HwHiAiUser/profiling_output/PROF_XXX

    After the parsing is complete, you can find the mindstudio_profiler_output directory generated in the PROF_XXX folder.

    Once you enable the profiling parameters, they create specific result files. For details, see Profiling Results.

Profiling Results

Table 1 Profiling result files

Argument

Result File

Automatically generated by default

msprof (Timeline Report)

op_summary_*.csv

op_statistic_*.csv

fusion_op_*.csv

step_trace (iteration trace data)

task_trace, task_time

The CANN layer in msprof_*.json and the api_statistic_*.csv file

The Ascend Hardware layer in msprof_*.json and the task_time_*.csv file

The Communication layer in msprof_*.json and the communication_statistic_*.csv file

step_trace_*.json

runtime_api

The CANN_Runtime layer in msprof_*.json and the api_statistic_*.csv file

hccl

The Communication layer in msprof_*.json and the communication_statistic_*.csv file

api_statistic_*.csv

aicpu

aicpu_*.csv

dp_*.csv

aic_metrics

op_summary_*.csv

l2

l2_cache_*.csv

msproftx

msproftx data

sys_hardware_mem_freq

On-chip memory read/write rate file

The LLC layer in msprof_*.json and the llc_read_write_*.csv file

The acc_pmu layer in msprof_*.json

The Stars Soc Info layer in msprof_*.json

The NPU MEM layer in msprof_*.json and the npu_mem_*.csv file

npu_module_mem_*.csv

llc_profiling

-

sys_io_sampling_freq

The NIC layer in msprof_*.json and the nic_*.csv file

The RoCE layer in msprof_*.json and the roce_*.csv file

sys_interconnection_freq

The PCIe layer in msprof_*.json and the pcie_*.csv file

The HCCS layer in msprof_*.json and the hccs_*.csv file

The Stars Chip Trans layer in msprof_*.json

dvpp_freq

dvpp_*.csv

instr_profiling_freq

biu_group, aic_core_group, and aiv_core_group levels in msprof_*.json

host_sys

The CPU Usage layer in msprof_*.json and the host_cpu_usage_*.csv file

The Memory Usage layer in msprof_*.json and the host_mem_usage_*.csv

host_sys_usage

System CPU usage on the host

CPU usage of processes on the host

System memory usage on the host

Memory usage of processes on the host

host_sys_usage_freq

-