msprof (Timeline Report)
Availability
Atlas 200/500 A2 Inference Product
Atlas Inference Series Product
Atlas Training Series Product
Atlas A2 Training Series Product/Atlas 800I A2 Inference Product
Atlas A3 Training Series Product
Timeline report: msprof_*.json.
The following figure shows a sample msprof_*.json file opened in chrome://tracing.
As shown in Figure 1, the timeline summary data is displayed in the following areas:
- Area 1: data at the application layer, including the time consumption data of upper-layer applications. The data needs to be profiled only in msproftx or PyTorch scenarios.
- Area 2: data at the CANN layer, including the time consumption data of components (such as Runtime) and nodes (operators).
- Area 3: bottom-layer NPU data, including the time consumption data and iteration trace data of each task stream under Ascend Hardware, Communication and Overlap Analysis communication data, and other Ascend AI Processor system data.
- Area 4: details about each operator and API in a timeline (displayed when you click a timeline).
- Data of the timeline report is described in detail in Profile Data File References.
- The data in each area of the above figure is related to the profiling scenario. Area 1 is generated only when data is profiled in msproftx or PyTorch scenarios; and Communication and Overlap Analysis communication data can only be profiled in multiple-rank, multi-node, and cluster communication scenarios. Use the actually collected data.
- The msprof_*.json file displays data within iterations. Data outside iterations is not displayed.
Operator Delivery Direction Check
When viewing a .json file in tracing, enable the option under Flow events, and the corresponding delivery and execution mappings between application-layer operators and NPU operators are displayed through connection lines. See Figure 2.
The mappings include:
- async_npu: delivery and execution mapping from application-layer operators to NPU operators on Ascend Hardware.
- MsTx: delivery and execution mapping from traininginference process marker tasks to NPU marker operators on Ascend Hardware. This mapping is generated when the aclprofMarkEx API is called to record markers.
- async_task_queue: mapping from enqueuing to dequeuing at the application layer.
- HostToDevice: delivery and execution mapping from CANN-layer nodes (operators) to NPU operators on Ascend Hardware (host to device).
- HostToDevice: delivery and execution mapping from CANN-layer nodes (operators) to communication operators (host to device).
- fwdbwd: mapping from forward APIs to backward APIs.
- Due to the deviation between the Ascend AI Processor frequency measured by software and the actual frequency, as well as the time synchronization error between the host and device, lower-layer operators may fail to be connected by lines due to misplacement.
- Whether mappings between layers are displayed depends on whether the data is collected in a specific scenario.
You can click the operator or API at each end of a connection line to view the operator delivery direction. See Figure 3.
View the inbound and outbound directions of an operator or API in the Event(s) column. View the information at both ends of a mapping in the Link column.
AI Core Frequency Viewing
Availability:
- Atlas 200/500 A2 Inference Product
- Atlas A2 Training Series Product/Atlas 800I A2 Inference Product
- Atlas A3 Training Series Product
The AI Core Freq layer in the msprof_*.json file displays the frequency changes of AI Cores during AI task running, as shown in Figure 4.
At timestamp 148089.72045898438, the AI Cores were operating at a high frequency; however, at timestamp 170178.44116210938, there was a decrease in frequency which indicates a deterioration in performance for AI tasks during that time period. The frequency of AI Cores could be lowered for certain reasons. First, as the temperature increases and the built-in protection mechanism is activated, the AI Core frequency may be reduced to prevent overheating. Second, if no AI task is in progress and the AI Cores enter a low-power state, their frequency could also be lowered.
When frequency changes, there is a delay of 0 ms to 1 ms between the actual frequency change time and the time monitored by the software. This delay may cause the operator execution time before and after frequency change to be inconsistent with the actual time.
SIO Data Analysis
Availability:
- For Atlas A2 Training Series Product/Atlas 800I A2 Inference Product, the data is 0, and is not for reference.
- Atlas A3 Training Series Product
The SIO layer in the msprof_*.json file displays the transmission bandwidth between dies in the Atlas A3 Training Series Product.
The horizontal coordinate of each color block in the figure corresponds to the time (unit: ms), and the vertical coordinate corresponds to the bandwidth value (unit: MB/s).
|
Field |
Description |
|---|---|
|
dat_rx |
RX bandwidth of the data stream channel. |
|
dat_tx |
TX bandwidth of the data stream channel. |
|
req_rx |
RX bandwidth of the request stream channel. |
|
req_tx |
TX bandwidth of the request stream channel. |
|
rsp_rx |
RX bandwidth of the response stream channel. |
|
rsp_tx |
TX bandwidth of the response stream channel. |
|
snp_rx |
RX bandwidth of the monitor stream channel. |
|
snp_tx |
TX bandwidth of the monitor stream channel. |
QoS Data Analysis
The QoS layer in the msprof_*.json file displays the device QoS bandwidth information.
Availability:
- Atlas A2 Training Series Product/Atlas 800I A2 Inference Product
- Atlas A3 Training Series Product
The horizontal coordinate of each color block in the figure corresponds to the time (unit: ms), and the vertical coordinate corresponds to the bandwidth value (unit: MB/s).
Computation and communication operator fusion MC2
Availability:
- Atlas Inference Series Product
- Atlas A2 Training Series Product/Atlas 800I A2 Inference Product
In scenarios where computing and communication operators are integrated.
MC2: Matrix Computation & Communication, a general name of a series of computation and communication fusion operators in CANN. It combines two serial communication and computation operators and divides them into multiple rounds of communication and computation through tiling. The rounds form pipeline parallelism, which masks the communication time and improves the overall execution performance.
Generally, a specific operator is named based on the name of the original computation and communication operator and the dependency. For example, the AllgatherMatmul fusion operator indicates that the communication operator Allgather and the computation operator Matmul are fused, and matmul depends on the Allgather output.
commTurn: number of tiling copies of the fusion operator. Generally, the value is the total data volume divided by the single communication volume.
In the MC2 implementation, two operators are loaded to the computation stream and communication stream, respectively. The two operators collaborate to implement parallel pipeline execution.
- The operator name corresponding to the computation stream is the name of the fusion operator, for example, AllgatherMatmul.
- The operator name corresponding to the stream is in the format of Fusion operator name+Aicpu, for example, AllgatherMatmulAicpu.
The communication operator performs multiple communication rounds based on the tiling of the fusion operator. The basic process of each round is as follows: The communication operator executes the collective communication algorithm based on the communication parameters delivered by the computing operator, orchestrates specific tasks, delivers the tasks to the hardware for execution, waits until the execution is complete, and notifies the computing side of the execution result.
- The MC2 fusion is not supported in the communication API scenario. The communication API scenario includes the MatmulAllReduce operator for low-bit communication and the custom MC2 operator that uses the communication API.
- The communication part of the timeline displays only level-0 data.
An example of the MC2 profile data result is as follows:
Figure 7 shows the MatmulAllReduceAddRmsNormAicpu fusion operator. For details about each stage, see Table 2.
|
Field |
Description |
|---|---|
|
StartServer |
KFC initialization time. |
|
TaskWaitRequest |
Wait for the computation operator to deliver communication parameters. |
|
TaskOrchestration |
The communication operator executes the collective communication algorithm and orchestrates and executes tasks. |
|
TaskLaunch |
Time required for issuing tasks. |
|
TaskExecute |
Time of waiting for the completion of a hardware task. |
|
Finalize |
KFC end process. |



