Parsing Results

The parsing results are saved in the path specified by --output-path.

Table 1 Mapping between domains and parsing results

Parsing Result

Domain

profiler.db

"BatchSchedule; ModelExecute; Request; KVCache"

chrome_tracing.json

No mandatory restriction. To view flow events between requests, you must profile the Request domain.

batch.csv

"BatchSchedule; ModelExecute"

kvcache.csv

"KVCache"

request.csv

"Request"

forward.csv

"BatchSchedule; ModelExecute"

pd_split_communication.csv

"Communication"

pd_split_kvcache.csv

"KVCache"

coordinator.csv

"Coordinator"

{host_name}_eplb_{i}_summed_hot_map_by_expert.png

"eplb_observe"

{host_name}_eplb_{i}_summed_hot_map_by_rank.png

"eplb_observe"

{host_name}_eplb_{i}_summed_hot_map_by_model_expert.png

"eplb_observe"

The parsing results of the acl_prof_task_time_level, aclDataTypeConfig, and aclprofAicoreMetrics parameters are not listed in the preceding table. For details about the parsing results of the three parameters, see Profiling Description and op_summary (Operator Details). The actual results may vary. The op_statistic_*.csv and op_summary_*.csv files are flushed to the PROF_XXX directory in the directory specified by --output-path. The profile data files collected using the three parameters are saved in the PROF_XXX/mindstudio_profiler_output directory in the directory specified by prof_dir.

The files are as follows:

profiler.db

SQLite database file used to generate line charts.

It contains the following database tables. The functions of the tables are as follows:

Table 2 profiler.db

Table Name

Description

batch

Displays batch table data on MindStudio Insight.

decode_gen_speed

Generates line charts showing average token latency at different time points in the decode phase.

first_token_latency

Generates line charts showing the time to first token (TTFT) of the serving framework.

kvcache

Generates line charts showing the KV cache memory usage during serving.

prefill_gen_speed

Generates line charts showing average token latency at different time points in the prefill phase.

req_latency

Generates line charts showing the end-to-end request latency of the serving framework.

request_status

Generates line charts showing the request status of the profile data at different time points.

request

Displays request table data on MindStudio Insight.

batch_exec

Displays the mapping between batches and model execution.

batch_req

Displays the mapping between batches and requests.

data_table

Displays table data on MindStudio Insight.

counter

Displays counter data in the trace view.

flow

Displays flow data in the trace view.

process

Displays secondary lane data in the trace view.

thread

Displays tertiary lane data in the trace view.

slice

Displays slice data in the trace view.

pd_split_kvcache

Displays KV cache table data of the decode node on MindStudio Insight, exclusive to prefill-decode (PD) disaggregation scenarios.

pd_split_communication

Displays communication table data between prefill and decode nodes on MindStudio Insight, exclusive to PD disaggregation scenarios.

ep_balance

Records load imbalance analysis results for the GroupedMatmul operator, profiled via MSPTI during DeepSeek MoE inference serving.

moe_analysis

Records fast/slow rank analysis results for the MoeDistributeCombine and MoeDistributeDispatch operators, profiled via MSPTI during DeepSeek MoE inference serving.

data_link

Enables drill-down on rid in the trace view to view request input length during the forward.

This file is intended for visualizing data in Grafana. Details about each entry are not described.

chrome_tracing.json

Records trace data of inference serving requests. You can visualize this data using various tools. Refer to Data Visualization for more information.

batch.csv

Records detailed batch-level data for inference serving.

Table 3 batch.csv

Field

Description

name

Batch grouping or execution.

batchFrameworkProcessing refers to batch grouping, while modelExec refers to batch execution.

res_list

List of grouped batches.

start_time(ms)

Start time of batch grouping or execution, in milliseconds.

end_time(ms)

End time of batch grouping or execution, in milliseconds.

batch_size

Number of requests in a batch.

batch_type

Request status (prefill or decode) in a batch.

during_time(ms)

Execution time, in milliseconds.

kvcache.csv

Records device memory usage during inference.

Table 4 kvcache.csv

Field

Description

domain

KV cache event mark.

rid

Request ID.

timestamp(ms)

Time when the device memory usage changes, in milliseconds.

name

Method of changing the device memory usage.

device_kvcache_left

Number of left blocks in the device memory.

request.csv

Records detailed request-level data for inference serving.

Table 5 request.csv

Field

Description

http_rid

HTTP request ID.

start_time(ms)

Request arrival time, in milliseconds.

recv_token_size

Input token length of a request.

reply_token_size

Output token length of a request.

execution_time(ms)

End-to-end request duration, in milliseconds.

queue_wait_time(ms)

Time for a request to wait in the queue throughout the entire inference process, including the time in the waiting and pending states, in milliseconds.

first_token_latency(ms)

TTFT, in milliseconds.

forward.csv

Records detailed execution data during the model forward in inference serving.

Table 6 forward.csv

Field

Description

name

Forward event mark, which indicates the forward process of the model.

relative_start_time(ms)

Time elapsed since the initial forward on each device.

start_time(ms)

Forward start time, in milliseconds.

end_time(ms)

Forward end time, in milliseconds.

during_time(ms)

Forward execution time, in milliseconds.

bubble_time(ms)

Bubble time between forwards, in milliseconds.

batch_size

Number of requests per forward.

batch_type

Request status in the forward.

forward_iter

Step ID of the forward across ranks.

dp_rank

DP information of the forward. The values for the same DP domain are the same.

prof_id

Rank ID. The values for the same rank are the same.

hostname

Host name. The values for the same device are the same.

pd_split_communication.csv

Records communication data in PD disaggregation scenarios. PD disaggregation works in cluster scenarios with multiple nodes and ranks. It requires using the shared configuration file during profiling (see Profiling).

For details about PD disaggregation and related concepts, see "Cluster Service Deployment" > "Deploying the Prefill-Decode Disaggregation Service" in MindIE Motor Development Guide.

Table 7 pd_split_communication.csv

Field

Description

rid

Request ID.

http_req_time(ms)

Request arrival time, in milliseconds.

send_request_time(ms)

Time when the prefill node starts to send a request to the decode node, in milliseconds.

send_request_succ_time(ms)

Time when the request is successfully sent, in milliseconds.

prefill_res_time(ms)

Time when prefill completes, in milliseconds.

request_end_time(ms)

Time when the request execution ends, in milliseconds.

pd_split_kvcache.csv

Records the KV cache transfer between prefill and decode nodes during inference based on PD disaggregation. PD disaggregation works in cluster scenarios with multiple nodes and ranks. It requires using the shared configuration file during profiling (see Profiling).

For details about PD disaggregation and related concepts, see "Cluster Service Deployment" > "Deploying the Prefill-Decode Disaggregation Service" in MindIE Motor Development Guide.

Table 8 pd_split_kvcache.csv

Field

Description

domain

PullKVCache event mark.

rank

Device ID.

rid

Request ID.

block_tables

block_tables information.

seq_len

Request length.

during_time(ms)

Time taken to transfer the KV cache from the prefill node to the decode node, in milliseconds.

start_datetime(ms)

Start time for the KV cache to be transferred from the prefill node to the decode node, displayed as a specific date, in milliseconds.

end_datetime(ms)

End time for the KV cache to be transferred from the prefill node to the decode node, displayed as a specific date, in milliseconds.

start_time(ms)

Start time for the KV cache to be transferred from the prefill node to the decode node, displayed as a timestamp, in milliseconds.

end_time(ms)

End time for the KV cache to be transferred from the prefill node to the decode node, displayed as a timestamp, in milliseconds.

coordinator.csv

Records changes in the number of requests distributed to each node during inference based on PD disaggregation. PD disaggregation works in cluster scenarios with multiple nodes and ranks. It requires using the shared configuration file during profiling (see Profiling).

For details about PD disaggregation and related concepts, see "Cluster Service Deployment" > "Deploying the Prefill-Decode Disaggregation Service" in MindIE Motor Development Guide.

Table 9 coordinator.csv

Field

Description

time

Time when the number of requests changes.

address

Address distributed to the node, in the format of IP address:Port number.

node_type

Node type (prefill or decode).

add_count

Number of added requests on the current node.

end_count

Number of ended requests on the current node.

running_count

Number of running requests on the current node.

ep_balance.csv

Records load imbalance analysis results for the GroupedMatmul operator, profiled via MSPTI during DeepSeek MoE inference serving.

Whenever ep_balance profile data is available, executing the parsing command will automatically generate a heatmap in the output directory. See Figure 1. In this heatmap, the x-axis represents the process ID for each device, while the y-axis represents the decoder layer of the model. Brighter pixels indicate longer duration. Greater color variation across rows indicates more pronounced load imbalance.

Table 10 ep_balance.csv

Field

Description

<Process ID> (row header)

Process ID of each device at runtime.

<Decoder Layer> (column value)

Decoder layer index of the model running on each device.

Figure 1 ep_balance.png

moe_analysis.csv

Records fast/slow rank analysis results for the MoeDistributeCombine and MoeDistributeDispatch operators, profiled via MSPTI during DeepSeek MoE inference serving.

Whenever the moe_analysis profile data is available, executing the parsing command will automatically generate a box plot in the output directory. See Figure 2. The x-axis represents the process ID for each device, while the y-axis represents the total execution duration. The plot displays the mean and the 2.5th/97.5th percentiles of the total execution duration. Greater disparity between ranks (wider percentile intervals) indicates more pronounced fast/slow rank issues.

Table 11 moe_analysis.csv

Field

Description

Dataset

Process ID of the corresponding device.

Mean

Mean total duration of the MoeDistributeCombine and MoeDistributeDispatch operators on this device.

CI Lower

2.5th percentile of the total duration for the MoeDistributeCombine and MoeDistributeDispatch operators on this device.

CI Upper

97.5th percentile of the total duration for the MoeDistributeCombine and MoeDistributeDispatch operators on this device.

Figure 2 moe_analysis.png

request_status.csv

Records the request status at each moment during inference serving (number of requests in the waiting, running, or swapped state). This data can be used to generate line charts that visualize request status trends over time.

Table 12 request_status.csv

Field

Description

hostuid

Node ID.

pid

Process ID.

timestamp(ms)

Timestamp, in milliseconds.

relative_timestamp(ms)

Relative timestamp, in milliseconds.

waiting

Number of requests in the waiting state.

running

Number of requests in the running state.

swapped

Number of requests in the swapped state.

{host_name}_eplb_{i}_summed_hot_map_by_expert.png

This is an expert hotspot heatmap. In Figure 3, pixel brightness reflects hotspot intensity (see the colorbar on the right), that is, brighter pixels signifies higher heat.

  • host_name indicates the name of the device where data is located.
  • i indicates the number of load balancing table updates during the serving profiling period when dynamic load balancing is enabled on MindIE. If dynamic load balancing is disabled, i is 0.
Figure 3 Heatmap

The x-axis represents the expert ID, while the y-axis represents the MoE layer of the model.

In the model instance, Rank_ID is sorted in ascending order, with experts indexed sequentially within each rank. For example, in a configuration with 16 ranks and 17 experts per rank, expert ID 42 corresponds to expert_7 (the 8th expert) on Rank_2 (the 3rd rank).

{host_name}_eplb_{i}_summed_hot_map_by_rank.png

This is an expert hotspot heatmap. In Figure 4, pixel brightness reflects hotspot intensity (see the colorbar on the right), that is, brighter pixels signifies higher heat.

  • host_name indicates the name of the device where the expert is located.
  • i indicates the number of load balancing table updates during the serving profiling period when dynamic load balancing is enabled on MindIE. If dynamic load balancing is disabled, i is 0.
Figure 4 Heatmap

The x-axis represents the rank ID, while the y-axis represents the MoE layer of the model.

{host_name}_eplb_{i}_summed_hot_map_by_model_expert.png

This is an expert hotspot heatmap. In Figure 5, pixel brightness reflects hotspot intensity (see the colorbar on the right), that is, brighter pixels signifies higher heat.

  • host_name indicates the name of the device where the expert is located.
  • i indicates the number of load balancing table updates during the serving profiling period when dynamic load balancing is enabled on MindIE. If dynamic load balancing is disabled, i is 0.
  • This heatmap is generated only when the dynamic load balancing feature of MindIE is enabled.
Figure 5 Heatmap

The x-axis represents the expert ID, with shared experts positioned at the end of the sequence. The y-axis represents the MoE layer of the model.