ProfilingConfig Constructor

Applicability

Product	Supported
Atlas A3 training products/Atlas A3 inference products	√
Atlas A2 training products/Atlas A2 inference products	√
Atlas 200I/500 A2 inference products	☓
Atlas inference products	☓
Atlas training products	√

Function Description

Constructs an object of class ProfilingConfig as the profiling configuration.

Function Prototype

def __init__(self,
enable_profiling=False,
profiling_options=None
)

Parameters

Option	Input/Output	Description
enable_profiling	Input	Profiling enable. True: enabled. The profiling options are determined by profiling_options. False (default): disabled.
profiling_options	Input	Profiling options. output: path for storing profiling result files. Both absolute path and relative path (relative to the path where the command is run) are supported. The path cannot contain the following special characters: "\n", "\f", "\r", "\b", "\t", "\v", and "\u007F". An absolute path starts with a slash (/), for example, /home/output. A relative path starts with a directory name, for example, output. It takes precedence over ASCEND_WORK_PATH. This path does not need to be created in advance because it is automatically created during collection. storage_limit: maximum size of files that can be stored in a specified disk directory. If the size of profile data files in the disk is about to use up the maximum storage space specified by this option or the total remaining disk space is about to be used up (remaining space ≤ 20 MB), the earliest files in the disk are aged and deleted. The value range is [200, 4294967295], and the unit is MB. The unit must be included when you set this parameter, for example, 200 MB. If this parameter is not set, the default value is 90% of the available space of the disk where the directory for storing profile data files is located. training_trace: iteration tracing switch. Collects software profile data of a training job and the AI Software Stack to profile the training job. Focuses on forward and backward propagation, and gradient aggregation and update. This option must be set to on when the forward and backward propagation operator data is collected. task_trace and task_time: switches that control collection of the operator delivery and execution durations. Related duration data must be output to the task_time, op_summary, and op_statistic files. Possible configuration values are as follows: on: switch on. This is the default value, delivering the same effect as l1. off: switch off. l0: collects operator delivery and execution duration data. Compared with l1, l0 does not collect basic operator information, so the performance overhead during collection is smaller, and this enables more accurate collection of statistics on time duration data. l1: collects operator delivery and execution duration data, as well as basic operator information, to provide more comprehensive performance analysis data. When Profiling is enabled to collect training data, task_trace and training_trace must be set to on. ge_api: switch that controls collection of the time consumption data of dynamic-shape operators in the host scheduling phase. Possible values are: off: switch off. The default value is off. l0: collects the time consumption data of dynamic-shape operators in the main host scheduling phase to facilitate accurate statistics. l1: collects finer-grained time consumption data of dynamic-shape operators in the host scheduling phase to provide more comprehensive performance analysis data. hccl: communication data collection switch, either on or off (default). NOTE: This switch will be deprecated in later versions. To control data collection, use task_time. aicpu: whether to collect details about the AI CPU operator, such as the operator execution time and data copy time. The value can be on or off (default). fp_point: start point of the forward propagated operator in iteration traces, to record the start timestamp of forward propagation. Set the value to the name of the top operator in forward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "fp_point":""), and the system will automatically identify the start point of the forward propagated operator. bp_point: end point of the backward propagated operator in iteration traces, to record the end timestamp of backward propagation. bp_point and fp_point are used to compute the time used by forward and backward propagation. Set the value to the name of the bottom operator in backward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "bp_point":""), and the system will automatically identify the end point of the backward propagated operator. aic_metrics: AI Core metric to profile. The options are as follows: ArithmeticUtilization: arithmetic utilization ratio. PipeUtilization (default): ratio of time taken by the compute units to that of MTEs. Memory: ratio of external memory read/write instructions. MemoryL0: ratio of internal memory L0 read/write instructions. MemoryUB: ratio of internal memory UB read/write instructions. ResourceConflictRatio: ratio of pipeline queue instructions. L2Cache: read/write L2 cache hits and re-allocations after cache misses Atlas inference products: This parameter is not supported. Atlas training products: This parameter is not supported. MemoryAccess: bandwidth of the operator's memory access on cores. Atlas inference products: This parameter is not supported. Atlas training products: This parameter is not supported. NOTE: The registers whose data is to be collected can be customized, for example, *"aic_metrics":"Custom:0x49,0x8,0x15,0x1b,0x64,0x10*". The Custom field indicates the customization type. It is set to specific register values in the range of [0x1, 0x6E]. A maximum of eight registers can be configured, which are separated with commas (,). The register value can be in hexadecimal or decimal format. l2: switch that controls L2 cache and TLB page table cache hit ratio, either on or off (default). Atlas inference products: supports collection of the L2 cache hit ratio. Atlas training products: supports collection of the L2 cache hit ratio. Atlas A2 training products/Atlas A2 inference products: supports collection of L2 cache and TLB page table cache hit ratio. aic-metrics=L2Cache is recommended for analyzing the number of hits on L2 cache from the AI Core. Atlas A3 training products/Atlas A3 inference products: supports collection of L2 cache and TLB page table cache hit ratio. aic-metrics=L2Cache is recommended for analyzing the number of hits on L2 cache from the AI Core. msproftx: switch that controls the msproftx user and upper-layer framework program to output profile data, either on or off (default). Add the following mstx API or msproftx API to the application script. The mstx API is recommended. runtime_api: runtime API data collection switch, either on or off (default). You can collect runtime API profile data, including the synchronous/asynchronous memory replication latencies between the host and device and between devices. sys_hardware_mem_freq: switch that controls the collection of the on-chip memory, QoS transmission bandwidth, LLC L3 cache bandwidth, accelerator bandwidth, SoC transmission bandwidth, and component memory usage. The collected content varies depending on the product. The actual result prevails. The value range is [1, 100], in Hz. Sampling memory data in the environment where glibc (2.34 or earlier) is installed may trigger a known Bug 19329. This problem can be solved by upgrading the glibc version. NOTE: For the following products, you are advised not to increase the profiling frequency after the profiling task is complete. Otherwise, SoC transmission bandwidth data may be lost. Atlas 200I/500 A2 inference products Atlas A2 training products/Atlas A2 inference products Atlas A3 training products/Atlas A3 inference products llc_profiling: LLC events to profile. Possible values are as follows: read (default): read events, that is, the L3 cache read rate. write: write events, that is, the L3 cache write rate. sys_io_sampling_freq: NIC and ROCE data collection frequency. The value range is [1, 100], in Hz. Atlas inference products: This parameter is not supported. Atlas A2 training products/Atlas A2 inference products: supports NIC and RoCE collection. Atlas A3 training products/Atlas A3 inference products: supports NIC and RoCE collection. sys_interconnection_freq: frequency of collecting collective communication bandwidth data (HCCS), SIO data, PCIe data and inter-chip transmission bandwidth information. The value range is [1, 50], in Hz. Atlas training products: supports HCCS and PCIe data collection. Atlas A2 training products/Atlas A2 inference products: supports HCCS, PCIe data, and inter-chip transmission bandwidth information collection. Atlas A3 training products/Atlas A3 inference products: supports HCCS, PCIe data, inter-chip transmission bandwidth information, and SIO data collection. dvpp_freq: DVPP collection frequency. The value range is [1, 100], in Hz. instr_profiling: AI Core and AI Vector bandwidth and latency collection switch. The value can be on or off (default). Atlas training products: This function is not supported. Atlas A2 training products/Atlas A2 inference products: This switch is not supported. This function is controlled through instr_profiling_freq. Atlas A3 training products/Atlas A3 inference products: This switch is not supported. This function is controlled through instr_profiling_freq. instr_profiling_freq: AI Core and AI Vector bandwidth and latency collection switch. If the collection frequency is configured, the related collection capability is enabled. The value range is [300, 30000]. The unit is Hz. Atlas training products: This function is not supported. Atlas A2 training products/Atlas A2 inference products: supported. However, instr_profiling_freq is mutually exclusive with training_trace, task_trace, hccl, aicpu, fp_point, bp_point, aic_metrics, l2, task_time, and runtime_api, so they cannot be executed at the same time. Atlas A3 training products/Atlas A3 inference products: supported. However, instr_profiling_freq is mutually exclusive with training_trace, task_trace, hccl, aicpu, fp_point, bp_point, aic_metrics, l2, task_time, and runtime_api, so they cannot be executed at the same time. host_sys: switch for collecting host profile data. You can select one or more options and separate them with commas (,), for example, "host_sys": "cpu,mem". cpu: process CPU utilization mem: process memory utilization host_sys_usage: Host-side system and process CPU and memory data collection option, selected from cpu and mem. You can select one or more options and separate them with commas (,). host_sys_usage_freq: Host-side system and process CPU and memory data collection frequency. The value range is [1, 50] and the default value is 50. The unit is Hz. NOTE: fp_point and bp_point require manual configuration only in the dynamic shape scenario. In the dynamic shape scenario, fp_point and bp_point must be configured manually. Online inference supports task_trace and aicpu but does not support training_trace*. Example: profiling_options = '{"output":"/tmp/profiling","training_trace":"on",task_trace":"on","fp_point":"","bp_point":"","aic_metrics":"PipeUtilization*"}'

Option

Input/Output

Description

enable_profiling

Input

Profiling enable.

True: enabled. The profiling options are determined by profiling_options.
False (default): disabled.

profiling_options

Input

Profiling options.

output: path for storing profiling result files. Both absolute path and relative path (relative to the path where the command is run) are supported. The path cannot contain the following special characters: "\n", "\f", "\r", "\b", "\t", "\v", and "\u007F".
- An absolute path starts with a slash (/), for example, /home/output.
- A relative path starts with a directory name, for example, output.
- It takes precedence over ASCEND_WORK_PATH.
- This path does not need to be created in advance because it is automatically created during collection.
storage_limit: maximum size of files that can be stored in a specified disk directory. If the size of profile data files in the disk is about to use up the maximum storage space specified by this option or the total remaining disk space is about to be used up (remaining space ≤ 20 MB), the earliest files in the disk are aged and deleted.
The value range is [200, 4294967295], and the unit is MB. The unit must be included when you set this parameter, for example, 200 MB.

If this parameter is not set, the default value is 90% of the available space of the disk where the directory for storing profile data files is located.
training_trace: iteration tracing switch. Collects software profile data of a training job and the AI Software Stack to profile the training job. Focuses on forward and backward propagation, and gradient aggregation and update. This option must be set to on when the forward and backward propagation operator data is collected.
task_trace and task_time: switches that control collection of the operator delivery and execution durations. Related duration data must be output to the task_time, op_summary, and op_statistic files. Possible configuration values are as follows:
- on: switch on. This is the default value, delivering the same effect as l1.
- off: switch off.
- l0: collects operator delivery and execution duration data. Compared with l1, l0 does not collect basic operator information, so the performance overhead during collection is smaller, and this enables more accurate collection of statistics on time duration data.
- l1: collects operator delivery and execution duration data, as well as basic operator information, to provide more comprehensive performance analysis data.
When Profiling is enabled to collect training data, task_trace and training_trace must be set to on.
ge_api: switch that controls collection of the time consumption data of dynamic-shape operators in the host scheduling phase. Possible values are:
- off: switch off. The default value is off.
- l0: collects the time consumption data of dynamic-shape operators in the main host scheduling phase to facilitate accurate statistics.
- l1: collects finer-grained time consumption data of dynamic-shape operators in the host scheduling phase to provide more comprehensive performance analysis data.
hccl: communication data collection switch, either on or off (default).
NOTE:
This switch will be deprecated in later versions. To control data collection, use task_time.
aicpu: whether to collect details about the AI CPU operator, such as the operator execution time and data copy time. The value can be on or off (default).
fp_point: start point of the forward propagated operator in iteration traces, to record the start timestamp of forward propagation. Set the value to the name of the top operator in forward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "fp_point":""), and the system will automatically identify the start point of the forward propagated operator.
bp_point: end point of the backward propagated operator in iteration traces, to record the end timestamp of backward propagation. bp_point and fp_point are used to compute the time used by forward and backward propagation. Set the value to the name of the bottom operator in backward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "bp_point":""), and the system will automatically identify the end point of the backward propagated operator.
aic_metrics: AI Core metric to profile. The options are as follows:
- ArithmeticUtilization: arithmetic utilization ratio.
- PipeUtilization (default): ratio of time taken by the compute units to that of MTEs.
- Memory: ratio of external memory read/write instructions.
- MemoryL0: ratio of internal memory L0 read/write instructions.
- MemoryUB: ratio of internal memory UB read/write instructions.
- ResourceConflictRatio: ratio of pipeline queue instructions.
- L2Cache: read/write L2 cache hits and re-allocations after cache misses
  Atlas inference products: This parameter is not supported.
  
  Atlas training products: This parameter is not supported.
- MemoryAccess: bandwidth of the operator's memory access on cores.
  Atlas inference products: This parameter is not supported.
  
  Atlas training products: This parameter is not supported.
NOTE:
The registers whose data is to be collected can be customized, for example, "aic_metrics":"Custom:0x49,0x8,0x15,0x1b,0x64,0x10".
- The Custom field indicates the customization type. It is set to specific register values in the range of [0x1, 0x6E].
- A maximum of eight registers can be configured, which are separated with commas (,).
- The register value can be in hexadecimal or decimal format.
l2: switch that controls L2 cache and TLB page table cache hit ratio, either on or off (default).
- Atlas inference products: supports collection of the L2 cache hit ratio.
- Atlas training products: supports collection of the L2 cache hit ratio.
- Atlas A2 training products/Atlas A2 inference products: supports collection of L2 cache and TLB page table cache hit ratio. aic-metrics=L2Cache is recommended for analyzing the number of hits on L2 cache from the AI Core.
- Atlas A3 training products/Atlas A3 inference products: supports collection of L2 cache and TLB page table cache hit ratio. aic-metrics=L2Cache is recommended for analyzing the number of hits on L2 cache from the AI Core.
msproftx: switch that controls the msproftx user and upper-layer framework program to output profile data, either on or off (default).
Add the following mstx API or msproftx API to the application script. The mstx API is recommended.
runtime_api: runtime API data collection switch, either on or off (default). You can collect runtime API profile data, including the synchronous/asynchronous memory replication latencies between the host and device and between devices.
sys_hardware_mem_freq: switch that controls the collection of the on-chip memory, QoS transmission bandwidth, LLC L3 cache bandwidth, accelerator bandwidth, SoC transmission bandwidth, and component memory usage. The collected content varies depending on the product. The actual result prevails. The value range is [1, 100], in Hz.
Sampling memory data in the environment where glibc (2.34 or earlier) is installed may trigger a known Bug 19329. This problem can be solved by upgrading the glibc version.

NOTE:
For the following products, you are advised not to increase the profiling frequency after the profiling task is complete. Otherwise, SoC transmission bandwidth data may be lost.

Atlas 200I/500 A2 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas A3 training products/Atlas A3 inference products
llc_profiling: LLC events to profile. Possible values are as follows:
- read (default): read events, that is, the L3 cache read rate.
- write: write events, that is, the L3 cache write rate.
sys_io_sampling_freq: NIC and ROCE data collection frequency. The value range is [1, 100], in Hz.
Atlas inference products: This parameter is not supported.

Atlas A2 training products/Atlas A2 inference products: supports NIC and RoCE collection.

Atlas A3 training products/Atlas A3 inference products: supports NIC and RoCE collection.
sys_interconnection_freq: frequency of collecting collective communication bandwidth data (HCCS), SIO data, PCIe data and inter-chip transmission bandwidth information. The value range is [1, 50], in Hz.
- Atlas training products: supports HCCS and PCIe data collection.
- Atlas A2 training products/Atlas A2 inference products: supports HCCS, PCIe data, and inter-chip transmission bandwidth information collection.
- Atlas A3 training products/Atlas A3 inference products: supports HCCS, PCIe data, inter-chip transmission bandwidth information, and SIO data collection.
dvpp_freq: DVPP collection frequency. The value range is [1, 100], in Hz.
instr_profiling: AI Core and AI Vector bandwidth and latency collection switch. The value can be on or off (default).
- Atlas training products: This function is not supported.
- Atlas A2 training products/Atlas A2 inference products: This switch is not supported. This function is controlled through instr_profiling_freq.
- Atlas A3 training products/Atlas A3 inference products: This switch is not supported. This function is controlled through instr_profiling_freq.
instr_profiling_freq: AI Core and AI Vector bandwidth and latency collection switch. If the collection frequency is configured, the related collection capability is enabled. The value range is [300, 30000]. The unit is Hz.
- Atlas training products: This function is not supported.
- Atlas A2 training products/Atlas A2 inference products: supported. However, instr_profiling_freq is mutually exclusive with training_trace, task_trace, hccl, aicpu, fp_point, bp_point, aic_metrics, l2, task_time, and runtime_api, so they cannot be executed at the same time.
- Atlas A3 training products/Atlas A3 inference products: supported. However, instr_profiling_freq is mutually exclusive with training_trace, task_trace, hccl, aicpu, fp_point, bp_point, aic_metrics, l2, task_time, and runtime_api, so they cannot be executed at the same time.
host_sys: switch for collecting host profile data. You can select one or more options and separate them with commas (,), for example, "host_sys": "cpu,mem".
- cpu: process CPU utilization
- mem: process memory utilization
host_sys_usage: Host-side system and process CPU and memory data collection option, selected from cpu and mem. You can select one or more options and separate them with commas (,).
host_sys_usage_freq: Host-side system and process CPU and memory data collection frequency. The value range is [1, 50] and the default value is 50. The unit is Hz.

NOTE:

fp_point and bp_point require manual configuration only in the dynamic shape scenario. In the dynamic shape scenario, fp_point and bp_point must be configured manually.
Online inference supports task_trace and aicpu but does not support training_trace.

Example:

profiling_options = 
'{"output":"/tmp/profiling","training_trace":"on",task_trace":"on","fp_point":"","bp_point":"","aic_metrics":"PipeUtilization"}'

Returns

An object of the ProfilingConfig class, as an argument passed to the NPURunConfig call.

Constraints

None

Example

from npu_bridge.npu_init import *
...
profiling_options = '{"output":"/home/test/output","task_trace":"on"}'
profiling_config = ProfilingConfig(enable_profiling=True, profiling_options= profiling_options)
session_config=tf.ConfigProto(allow_soft_placement=True)
config = NPURunConfig(profiling_config=profiling_config, session_config=session_config)

Parent topic: npu_bridge.estimator.npu.npu_config