Profile Data Collection with Environment Variables

Data collection with environment variables applies to training/online inference of the TensorFlow framework. Unlike the collection mode using the TensorFlow framework API, the environment variable mode is to directly insert the PROFILING_OPTIONS environment variable into the training/online inference script to configure profile data collection items.

Prerequisites

  • Training scenario:
    • Prepare a model trained on TensorFlow 1.x and a matched dataset, and port the model to the Ascend AI Processor. For details, see "Manual Porting" or "Automated Porting" in the TensorFlow 1.15 Model Porting Guide.
    • Prepare a model trained on TensorFlow 2.x and a matched dataset, and port the model to the Ascend AI Processor. For details, see "Manual Porting" in the TensorFlow 2.6.5 Model Porting Guide.
  • Online inference scenario: Download a pre-trained model and prepare the online inference script.

Profile Data Collection

The following is an example.
export PROFILING_MODE=true
export PROFILING_OPTIONS='{"output":"/tmp/profiling","training_trace":"on","task_trace":"on","fp_point":"","bp_point":"","aic_metrics":"PipeUtilization"}'

For details about PROFILING_OPTIONS, see Profiling Options.

If profiling_mode is set to true but profiling_options are not set, training_trace, task_trace, hccl, aicpu, and aic_metrics (PipeUtilization) are executed by default, and the collected data is saved to the directory where the current AI job is located. If profiling_mode is set to true and any option of profiling_options is set, the default values of profiling_options are described in Profiling Options.

Data Collection Description

After the PROFILING_OPTIONS parameter is set, parse the raw data, export the result files as visualized profile data files, and save these files in the PROF_XXX/mindstudio_profiler_output directory. For details, see Profile Data Parsing and Export (msprof Command).

The generated profile data is shown in Table 1.

Table 1 Introduction to profile data files

Argument

Profile Data File

Automatically generated by default

msprof (Timeline Report)

op_summary_*.csv

op_statistic_*.csv

fusion_op_*.csv

step_trace (iteration trace data)

task_trace, task_time

The CANN level in msprof_*.json and the api_statistic_*.csv file

The Ascend Hardware level in msprof_*.json and the task_time_*.csv file

The HCCL level in msprof_*.json and the hccl_statistic_*.csv file

step_trace_*.json

runtime_api

The CANN_Runtime level in msprof_*.json and the api_statistic_*.csv file

hccl

The HCCL level in msprof_*.json and the hccl_statistic_*.csv file

api_statistic_*.csv

aicpu

aicpu_*.csv

dp_*.csv

aic_metrics

op_summary_*.csv

l2

l2_cache_*.csv

msproftx

msproftx data

sys_hardware_mem_freq

On-chip memory read/write rate file

The LLC level in msprof_*.json and the llc_read_write_*.csv file

The NPU MEM level in msprof_*.json and the npu_mem_*.csv file

npu_module_mem_*.csv

llc_profiling

-

sys_io_sampling_freq

The NIC level in msprof_*.json and the nic_*.csv file

The RoCE level in msprof_*.json and the roce_*.csv file

sys_interconnection_freq

The PCIe level in msprof_*.json and the pcie_*.csv file

The HCCS level in msprof_*.json and the hccs_*.csv file

dvpp_freq

dvpp_*.csv

host_sys

The CPU Usage level in msprof_*.json and the host_cpu_usage_*.csv file

The Memory Usage level in msprof_*.json and the host_mem_usage_*.csv file

host_sys_usage

System CPU usage on the host

CPU usage of processes on the host

System memory usage on the host

Memory usage of processes on the host

host_sys_usage_freq

-