Getting Started with Performance Analysis in TensorFlow Training Scenarios

In TensorFlow training scenarios, you are advised to use the TensorFlow Adapter API to enable profiling and upload the result files to the development environment where the CANN Toolkit package and ops operator package are installed to parse the data and analyze performance bottlenecks.

Prerequisites

You have installed the CANN Toolkit package and ops operator package.
For details, see CANN Software Installation Guide.
The training/online inference script is successfully executed on the Ascend AI Processor.

Collecting, Parsing, and Exporting Profile Data

Modify the training script and enable profiling.

The following uses a script in TensorFlow 1.15 session_run mode as an example.

Use the session configuration options profiling_mode and profiling_options to enable data collection of the profiling tool. The sample code is as follows:

custom_op = config.graph_options.rewrite_options.custom_optimizers.add()
custom_op.name = "NpuOptimizer"
custom_op.parameter_map["use_off_line"].b = True
# Enable profile data collection.
custom_op.parameter_map["profiling_mode"].b = True
# Profile data collection items
# output is the output path of the collection result.
# task_trace enables task trace collection.
# training_trace enables iteration trace collection. fp_point (start point of the forward propagated operator in iteration traces) and bp_point (end point of the backward propagated operator in iteration traces) are required for collecting iteration traces. You can leave them empty to make the system obtain the values. Manual configuration is required when data collection is abnormal.
custom_op.parameter_map["profiling_options"].s = tf.compat.as_bytes('{"output":"/home/HwHiAiUser/profiling_output","training_trace":"on","task_trace":"on","fp_point":"","bp_point":"","aic_metrics":"PipeUtilization"}') 
config.graph_options.rewrite_options.remapping = RewriterConfig.OFF  # Disable remapping.
with tf.Session(config=config) as sess:
    sess.run()

The preceding items are the most basic collection items. For other collection requirements, see Using TensorFlow APIs for Data Profiling.
The methods of modifying the Estimator and Keras scripts are slightly different. The methods of modifying the manual migration script and automatic migration script are also slightly different. For details, see Using TensorFlow APIs for Data Profiling.

Re-execute the training script.
After the training/online inference is complete, the PROF_XXX folder is generated in the directory specified by output to store the collected raw profile data. The data can be viewed only after being parsed by the msprof parsing tool.

Run the msprof command to parse and export the profile data.

msprof --export=on --output=/home/HwHiAiUser/profiling_output/PROF_XXX

--output indicates the path for storing the profile data files, which is set during profile data collection.

After the command is executed, find the PROF_XXX directory generated in the directory specified by --output. This directory stores the collected and automatically parsed profile data. The directory structure is as follows (only profile data is displayed):

├── host   // Save the original data. You can ignore this step.
...
│    └── data
├── device_{id}   // Save the original data. You can ignore this step.
...
│    └── data
...
├── msprof_*.db
├── mindstudio_profiler_output
      ├── msprof_{timestamp}.json
      ├── step_trace_{timestamp}.json
      ├── xx_*.csv
       ...
      └── README.txt

Access the mindstudio_profiler_output directory to view corresponding profile data files.

For details about the files collected by default, see Table 1.

**Table 1** Profile data files collected by msprof by default
File Name	Description
msprof_*.db	.db file that aggregates all profile data. This file is exported by default only for the Atlas A3 Training Series Product and Atlas A2 Training Series Product/Atlas 800I A2 Inference Product.
msprof_*.json	Timeline report.
step_trace_*.json	Iteration trace data, which records the time required for each iteration. This profile data file does not exist in single-operator scenarios.
op_summary_*.csv	AI Core and AI CPU operator data.
op_statistic_*.csv	Number of times that the AI Core and AI CPU operators are called and the time consumption.
step_trace_*.csv	Iteration trace data. This profile data file does not exist in single-operator scenarios.
task_time_*.csv	Task Scheduler data.
fusion_op_*.csv	Operator fusion summary in a model. This profile data file does not exist in single-operator scenarios.
api_statistic_*.csv	Time spent by API execution at the CANN layer.
Note: The asterisk (*) indicates the timestamp.

You are advised to use MindStudio Insight to analyze the .db file. For details, see MindStudio Insight User Guide.
To open a timeline .json file, enter chrome://tracing in the address box of Google Chrome, drag the file to the blank space to open it, and press the shortcut keys (w: zoom in; s: zoom out; a: move left; d: move right) on the keyboard to view it. You can view the running timeline information of the current AI task in the file, such as the API call timeline during task running, as shown in Figure 1.
Figure 1 Viewing a .json file
You can directly open a summary .csv file to view it. You can view the software and hardware data of the AI task running in the .csv file, such as the time required by each operator to run on the AI processor software and hardware. You can quickly find the required information by sorting fields, as shown in Figure 2.
Figure 2 Viewing a .csv file

Performance Analysis

The preceding information shows that there are many profile data files and the analysis methods are flexible. The following introduces several important files and corresponding analysis methods.

Analyze the step_trace_*.csv file to obtain the iteration trace data. This file records the duration of each iteration.
Figure 3 Example of the step_trace_*.csv file
The main fields are as follows:
- Iteration Time: computation time of an iteration, including the time of the FP/BP and Grad Refresh phases.
- FP to BP Time: computation time of forward and backward propagation on the network.
- Iteration Refresh: iteration trailing time.
- Data Aug Bound: interval between two adjacent iterations.
Analyze the op_statistic_*.csv file to obtain the total calling duration and total number of calls of each operator type, check whether there is any type of operators that consume long execution time, and analyze whether these operators can be optimized.
Figure 4 Example of the op_statistic_*.csv file

You can sort the operators by Total Time to find out which type of operators takes a long time.
Analyze the op_summary_*.csv file to obtain the basic information and time consumption of a specific operator, find the operator with high time consumption, and check whether there is any optimization space for the operator.
Figure 5 Example of the op_summary_*.csv file

The Task Duration field specifies the operator time consumption. You can sort operators by Task Duration to find time-consuming operators, or sort them by Task Type to view the time-consuming operators running on different cores (such as AI Cores and AI CPUs).