msprof (Timeline Report)

Availability

Atlas 200/500 A2 Inference Product

Atlas Inference Series Product

Atlas Training Series Product

Atlas A2 Training Series Product/Atlas 800I A2 Inference Product

Atlas A3 Training Series Product

Timeline report: msprof_*.json.

The following figure shows a sample msprof_*.json file opened in chrome://tracing.

Figure 1 Timeline summary display

As shown in Figure 1, the timeline summary data is displayed in the following areas:

Area 1: data at the application layer, including the time consumption data of upper-layer applications. The data needs to be profiled only in msproftx or PyTorch scenarios.
Area 2: data at the CANN layer, including the time consumption data of components (such as Runtime) and nodes (operators).
Area 3: bottom-layer NPU data, including the time consumption data and iteration trace data of each task stream under Ascend Hardware, Communication and Overlap Analysis communication data, and other Ascend AI Processor system data.
Area 4: details about each operator and API in a timeline (displayed when you click a timeline).

Data of the timeline report is described in detail in Profile Data File References.
The data in each area of the above figure is related to the profiling scenario. Area 1 is generated only when data is profiled in msproftx or PyTorch scenarios; and Communication and Overlap Analysis communication data can only be profiled in multiple-rank, multi-node, and cluster communication scenarios. Use the actually collected data.
The msprof_*.json file displays data within iterations. Data outside iterations is not displayed.

Operator Delivery Direction Check

When viewing a .json file in tracing, enable the option under Flow events, and the corresponding delivery and execution mappings between application-layer operators and NPU operators are displayed through connection lines. See Figure 2.

The mappings include:

async_npu: delivery and execution mapping from application-layer operators to NPU operators on Ascend Hardware.
MsTx: delivery and execution mapping from traininginference process marker tasks to NPU marker operators on Ascend Hardware. This mapping is generated when the aclprofMarkEx API is called to record markers.
async_task_queue: mapping from enqueuing to dequeuing at the application layer.
HostToDevice: delivery and execution mapping from CANN-layer nodes (operators) to NPU operators on Ascend Hardware (host to device).
HostToDevice: delivery and execution mapping from CANN-layer nodes (operators) to communication operators (host to device).
fwdbwd: mapping from forward APIs to backward APIs.

Due to the deviation between the Ascend AI Processor frequency measured by software and the actual frequency, as well as the time synchronization error between the host and device, lower-layer operators may fail to be connected by lines due to misplacement.

Whether mappings between layers are displayed depends on whether the data is collected in a specific scenario.

Figure 2 Operator mappings

You can click the operator or API at each end of a connection line to view the operator delivery direction. See Figure 3.

Figure 3 Operator information

View the inbound and outbound directions of an operator or API in the Event(s) column. View the information at both ends of a mapping in the Link column.

AI Core Frequency Viewing

Availability:

Atlas 200/500 A2 Inference Product
Atlas A2 Training Series Product/Atlas 800I A2 Inference Product
Atlas A3 Training Series Product

The AI Core Freq layer in the msprof_*.json file displays the frequency changes of AI Cores during AI task running, as shown in Figure 4.

Figure 4 AI Core Frequency Viewing

At timestamp 148089.72045898438, the AI Cores were operating at a high frequency; however, at timestamp 170178.44116210938, there was a decrease in frequency which indicates a deterioration in performance for AI tasks during that time period. The frequency of AI Cores could be lowered for certain reasons. First, as the temperature increases and the built-in protection mechanism is activated, the AI Core frequency may be reduced to prevent overheating. Second, if no AI task is in progress and the AI Cores enter a low-power state, their frequency could also be lowered.

When frequency changes, there is a delay of 0 ms to 1 ms between the actual frequency change time and the time monitored by the software. This delay may cause the operator execution time before and after frequency change to be inconsistent with the actual time.

SIO Data Analysis

Availability:

For Atlas A2 Training Series Product/Atlas 800I A2 Inference Product, the data is 0, and is not for reference.
Atlas A3 Training Series Product

The SIO layer in the msprof_*.json file displays the transmission bandwidth between dies in the Atlas A3 Training Series Product.

Figure 5 SIO

The horizontal coordinate of each color block in the figure corresponds to the time (unit: ms), and the vertical coordinate corresponds to the bandwidth value (unit: MB/s).

**Table 1** Field description
Field	Description
dat_rx	RX bandwidth of the data stream channel.
dat_tx	TX bandwidth of the data stream channel.
req_rx	RX bandwidth of the request stream channel.
req_tx	TX bandwidth of the request stream channel.
rsp_rx	RX bandwidth of the response stream channel.
rsp_tx	TX bandwidth of the response stream channel.
snp_rx	RX bandwidth of the monitor stream channel.
snp_tx	TX bandwidth of the monitor stream channel.

QoS Data Analysis

The QoS layer in the msprof_*.json file displays the device QoS bandwidth information.

Availability:

Atlas A2 Training Series Product/Atlas 800I A2 Inference Product
Atlas A3 Training Series Product

Figure 6 QoS OTHERS

The horizontal coordinate of each color block in the figure corresponds to the time (unit: ms), and the vertical coordinate corresponds to the bandwidth value (unit: MB/s).

Computation and communication operator fusion MC²

Availability:

Atlas Inference Series Product
Atlas A2 Training Series Product/Atlas 800I A2 Inference Product

In scenarios where computing and communication operators are integrated.

MC²: Matrix Computation & Communication, a general name of a series of computation and communication fusion operators in CANN. It combines two serial communication and computation operators and divides them into multiple rounds of communication and computation through tiling. The rounds form pipeline parallelism, which masks the communication time and improves the overall execution performance.

Generally, a specific operator is named based on the name of the original computation and communication operator and the dependency. For example, the AllgatherMatmul fusion operator indicates that the communication operator Allgather and the computation operator Matmul are fused, and matmul depends on the Allgather output.

commTurn: number of tiling copies of the fusion operator. Generally, the value is the total data volume divided by the single communication volume.

In the MC² implementation, two operators are loaded to the computation stream and communication stream, respectively. The two operators collaborate to implement parallel pipeline execution.

The operator name corresponding to the computation stream is the name of the fusion operator, for example, AllgatherMatmul.
The operator name corresponding to the stream is in the format of Fusion operator name+Aicpu, for example, AllgatherMatmulAicpu.

The communication operator performs multiple communication rounds based on the tiling of the fusion operator. The basic process of each round is as follows: The communication operator executes the collective communication algorithm based on the communication parameters delivered by the computing operator, orchestrates specific tasks, delivers the tasks to the hardware for execution, waits until the execution is complete, and notifies the computing side of the execution result.

The MC² fusion is not supported in the communication API scenario. The communication API scenario includes the MatmulAllReduce operator for low-bit communication and the custom MC² operator that uses the communication API.
The communication part of the timeline displays only level-0 data.

An example of the MC² profile data result is as follows:

Figure 7 MC²

Figure 7 shows the MatmulAllReduceAddRmsNormAicpu fusion operator. For details about each stage, see Table 2.

**Table 2** Field description
Field	Description
StartServer	KFC initialization time.
TaskWaitRequest	Wait for the computation operator to deliver communication parameters.
TaskOrchestration	The communication operator executes the collective communication algorithm and orchestrates and executes tasks.
TaskLaunch	Time required for issuing tasks.
TaskExecute	Time of waiting for the completion of a hardware task.
Finalize	KFC end process.

Parent topic: Profile Data File References

msprof (Timeline Report)

Availability

Operator Delivery Direction Check

AI Core Frequency Viewing

SIO Data Analysis

QoS Data Analysis

Computation and communication operator fusion MC2

Computation and communication operator fusion MC²