ProfilingConfig Constructor

Description

Constructs an object of class ProfilingConfig as the profiling configuration.

Prototype

def __init__(self,
enable_profiling=False,
profiling_options=None
)

Options

Option	Input/Output	Description
enable_profiling	Input	Profiling enable. True: enabled. The profiling options are determined by profiling_options. False (default): disabled.
profiling_options	Input	Profiling options. output: path for storing profiling result files. Both absolute path and relative path (relative to the path where the command is run) are supported. The path cannot contain the following special characters: "\n", "\f", "\r", "\b", "\t", "\v", and "\u007F". An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output. A relative path starts with a directory name, for example, output. This parameter takes precedence over ASCEND_WORK_PATH. This path does not need to be created in advance because it is automatically created during collection. storage_limit: maximum size of files that can be stored in a specified disk directory. If the size of profile data files in the disk is about to use up the maximum storage space specified by this option or the total remaining disk space is about to be used up (remaining space ≤ 20 MB), the earliest files in the disk are aged and deleted. The value range is [200, 4294967295], and the unit is MB. The unit must be included when you set this parameter, for example, 200 MB. If this parameter is not set, the default value is 90% of the available space of the disk where the directory for storing profile data files is located. training_trace: iteration tracing switch. Collects software profile data of a training job and the AI Software Stack to profile the training job. Focuses on forward and backward propagation, and gradient aggregation and update. This option must be set to on when the forward and backward propagation operator data is collected. task_trace and task_time: Switches that control collection of the operator delivery and execution durations. Related duration data must be output to the task_time, op_summary, and op_statistic files. Possible configuration values are as follows: on: switch on. This is the default value, delivering the same effect as l1. off: switch off. l0: collects operator delivery and execution duration data. Compared with l1, l0 does not collect basic operator information, so the performance overhead during collection is smaller, and this enables more accurate collection of statistics on time duration data. l1: collects operator delivery and execution duration data, as well as basic operator information, to provide more comprehensive performance analysis data. When Profiling is enabled to collect training data, task_trace and training_trace must be set to on. hccl (optional): HCCL tracing switch, either on or off (default). NOTE: This switch will be discarded in later versions. To control data collection, use task_trace and task_time. aicpu: whether to collect details about the AI CPU operator, such as the operator execution time and data copy time. The value can be on or off (default). A value other than on or off is equivalent to off. fp_point: start point of the forward propagated operator in iteration traces, to record the start timestamp of forward propagation. Set the value to the name of the top operator in forward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "fp_point":""), and the system will automatically identify the start point of the forward propagated operator. bp_point: end point of the backward propagated operator in iteration traces, to record the end timestamp of backward propagation. BP_POINT and FP_POINT are used to compute the time used by forward and backward propagation. Set the value to the name of the bottom operator in backward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "bp_point":""), and the system will automatically identify the end point of the backward propagated operator. aic_metrics: AI Core metrics to profile. ArithmeticUtilization: arithmetic utilization ratio. PipeUtilization (default): ratio of time taken by the compute units to that of MTEs. Memory: ratio of external memory read/write instructions. MemoryL0: ratio of internal memory L0 read/write instructions. MemoryUB: ratio of internal memory UB read/write instructions. ResourceConflictRatio: ratio of pipeline queue instructions. Atlas Training Series Product: AI Core collection is supported, but AI Vector Core and L2 cache parameters are not supported. NOTE: The registers whose data is to be collected can be customized, for example, *"aic_metrics":"Custom:0x49,0x8,0x15,0x1b,0x64,0x10*". The Custom field indicates the custom type and is set to a specific register value. The value range is [0x1, 0x6E]. A maximum of eight registers can be configured, which are separated with commas (,). The register value can be in hexadecimal or decimal format. l2: L2 cache profiling switch, either on or off (default). msproftx: switch that controls the msproftx user and upper-layer framework program to output profile data, either on or off (default). runtime_api: Runtime API data collection switch, either on or off (default). You can collect Runtime API profile data, including the synchronous/asynchronous memory replication latencies between the host and device and between devices. sys_hardware_mem_freq: indicates the frequency of collecting On-chip memory, QoS bandwidth and memory information, LLC read/write bandwidth data, Acc PMU data and SoC transmission bandwidth data, and component memory information. Must be within the range [1,100]. The unit is Hz. The support for different products varies. NOTE: Sampling memory data in the environment where glibc (2.34 or an earlier version) is installed may trigger a known Bug 19329. This problem can be solved by upgrading the glibc version. llc_profiling: LLC events to profile. Possible values are as follows: Atlas Training Series Product: read (read event, L3 cache read rate) or write (write event, L3 cache write rate). Defaults to read. sys_io_sampling_freq: NIC and RoCE collection frequency. The value range is [1,100]. The unit is Hz. Atlas Training Series Product: supports NIC and RoCE collection. sys_interconnection_freq: HCCS bandwidth and PCIe data collection frequency and inter-chip transmission bandwidth data collection frequency. The value range is [1, 50]. The unit is Hz. Atlas Training Series Product: supports HCCS and PCIe data collection. dvpp_freq: DVPP collection frequency. The value range is [1,100]. The unit is Hz. instr_profiling_freq: AI Core and AI Vector bandwidth and latency collection frequency. The value range is [300, 30000]. The unit is cycle. Atlas Training Series Product: Not supported. host_sys: switch for collecting host profile data. You can select one or more options and separate them with commas (,), for example, "host_sys": "cpu,mem". cpu: process CPU utilization mem: process memory utilization host_sys_usage: CPU and memory data of the system and all processes on the host, selected from cpu and mem. You can select one or more options and separate them with commas (,). host_sys_usage_freq: collection frequency of CPU and memory data of the system and all processes on the host. The value range is [1, 50] and the default value is 50. The unit is Hz. NOTE: fp_point and bp_point require manual configuration only in the dynamic shape scenario. In the dynamic shape scenario, fp_point and bp_point must be configured manually. Online inference supports task_trace and aicpu but does not support training_trace*. Example: profiling_options = '{"output":"/tmp/profiling","training_trace":"on",task_trace":"on","fp_point":"","bp_point":"","aic_metrics":"PipeUtilization*"}'

Option

Input/Output

Description

enable_profiling

Input

Profiling enable.

True: enabled. The profiling options are determined by profiling_options.
False (default): disabled.

profiling_options

Input

Profiling options.

output: path for storing profiling result files. Both absolute path and relative path (relative to the path where the command is run) are supported. The path cannot contain the following special characters: "\n", "\f", "\r", "\b", "\t", "\v", and "\u007F".
- An absolute path starts with a slash (/), for example, /home/HwHiAiUser/output.
- A relative path starts with a directory name, for example, output.
- This parameter takes precedence over ASCEND_WORK_PATH.
- This path does not need to be created in advance because it is automatically created during collection.
storage_limit: maximum size of files that can be stored in a specified disk directory. If the size of profile data files in the disk is about to use up the maximum storage space specified by this option or the total remaining disk space is about to be used up (remaining space ≤ 20 MB), the earliest files in the disk are aged and deleted.
The value range is [200, 4294967295], and the unit is MB. The unit must be included when you set this parameter, for example, 200 MB.

If this parameter is not set, the default value is 90% of the available space of the disk where the directory for storing profile data files is located.
training_trace: iteration tracing switch. Collects software profile data of a training job and the AI Software Stack to profile the training job. Focuses on forward and backward propagation, and gradient aggregation and update. This option must be set to on when the forward and backward propagation operator data is collected.
task_trace and task_time: Switches that control collection of the operator delivery and execution durations. Related duration data must be output to the task_time, op_summary, and op_statistic files. Possible configuration values are as follows:
- on: switch on. This is the default value, delivering the same effect as l1.
- off: switch off.
- l0: collects operator delivery and execution duration data. Compared with l1, l0 does not collect basic operator information, so the performance overhead during collection is smaller, and this enables more accurate collection of statistics on time duration data.
- l1: collects operator delivery and execution duration data, as well as basic operator information, to provide more comprehensive performance analysis data.
When Profiling is enabled to collect training data, task_trace and training_trace must be set to on.
hccl (optional): HCCL tracing switch, either on or off (default).
NOTE:
This switch will be discarded in later versions. To control data collection, use task_trace and task_time.
aicpu: whether to collect details about the AI CPU operator, such as the operator execution time and data copy time. The value can be on or off (default). A value other than on or off is equivalent to off.
fp_point: start point of the forward propagated operator in iteration traces, to record the start timestamp of forward propagation. Set the value to the name of the top operator in forward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "fp_point":""), and the system will automatically identify the start point of the forward propagated operator.
bp_point: end point of the backward propagated operator in iteration traces, to record the end timestamp of backward propagation. BP_POINT and FP_POINT are used to compute the time used by forward and backward propagation. Set the value to the name of the bottom operator in backward propagation. You can save the graph as a .pbtxt file by using tf.io.write_graph in the training script to obtain the name. Alternatively, you can leave this option empty (for example, "bp_point":""), and the system will automatically identify the end point of the backward propagated operator.
aic_metrics: AI Core metrics to profile.
- ArithmeticUtilization: arithmetic utilization ratio.
- PipeUtilization (default): ratio of time taken by the compute units to that of MTEs.
- Memory: ratio of external memory read/write instructions.
- MemoryL0: ratio of internal memory L0 read/write instructions.
- MemoryUB: ratio of internal memory UB read/write instructions.
- ResourceConflictRatio: ratio of pipeline queue instructions.
Atlas Training Series Product: AI Core collection is supported, but AI Vector Core and L2 cache parameters are not supported.
NOTE:
The registers whose data is to be collected can be customized, for example, "aic_metrics":"Custom:0x49,0x8,0x15,0x1b,0x64,0x10".
- The Custom field indicates the custom type and is set to a specific register value. The value range is [0x1, 0x6E].
- A maximum of eight registers can be configured, which are separated with commas (,).
- The register value can be in hexadecimal or decimal format.
l2: L2 cache profiling switch, either on or off (default).
msproftx: switch that controls the msproftx user and upper-layer framework program to output profile data, either on or off (default).
runtime_api: Runtime API data collection switch, either on or off (default). You can collect Runtime API profile data, including the synchronous/asynchronous memory replication latencies between the host and device and between devices.
sys_hardware_mem_freq: indicates the frequency of collecting On-chip memory, QoS bandwidth and memory information, LLC read/write bandwidth data, Acc PMU data and SoC transmission bandwidth data, and component memory information. Must be within the range [1,100]. The unit is Hz.
The support for different products varies.

NOTE:
Sampling memory data in the environment where glibc (2.34 or an earlier version) is installed may trigger a known Bug 19329. This problem can be solved by upgrading the glibc version.
llc_profiling: LLC events to profile. Possible values are as follows:
- Atlas Training Series Product: read (read event, L3 cache read rate) or write (write event, L3 cache write rate). Defaults to read.
sys_io_sampling_freq: NIC and RoCE collection frequency. The value range is [1,100]. The unit is Hz.
- Atlas Training Series Product: supports NIC and RoCE collection.
sys_interconnection_freq: HCCS bandwidth and PCIe data collection frequency and inter-chip transmission bandwidth data collection frequency. The value range is [1, 50]. The unit is Hz.
- Atlas Training Series Product: supports HCCS and PCIe data collection.
dvpp_freq: DVPP collection frequency. The value range is [1,100]. The unit is Hz.
instr_profiling_freq: AI Core and AI Vector bandwidth and latency collection frequency. The value range is [300, 30000]. The unit is cycle.
- Atlas Training Series Product: Not supported.
host_sys: switch for collecting host profile data. You can select one or more options and separate them with commas (,), for example, "host_sys": "cpu,mem".
- cpu: process CPU utilization
- mem: process memory utilization
host_sys_usage: CPU and memory data of the system and all processes on the host, selected from cpu and mem. You can select one or more options and separate them with commas (,).
host_sys_usage_freq: collection frequency of CPU and memory data of the system and all processes on the host. The value range is [1, 50] and the default value is 50. The unit is Hz.

NOTE:

fp_point and bp_point require manual configuration only in the dynamic shape scenario. In the dynamic shape scenario, fp_point and bp_point must be configured manually.
Online inference supports task_trace and aicpu but does not support training_trace.

Example:

profiling_options = 
'{"output":"/tmp/profiling","training_trace":"on",task_trace":"on","fp_point":"","bp_point":"","aic_metrics":"PipeUtilization"}'

Returns

An object of the ProfilingConfig class, as an argument passed to the NPURunConfig call.

Restrictions

None

Examples

from npu_bridge.npu_init import *
...
profiling_options = '{"output":"/home/HwHiAiUser/output","task_trace":"on"}'
profiling_config = ProfilingConfig(enable_profiling=True, profiling_options= profiling_options)
session_config=tf.ConfigProto(allow_soft_placement=True)
config = NPURunConfig(profiling_config=profiling_config, session_config=session_config)

Parent topic: npu_bridge.estimator.npu.npu_config