step_trace (Iteration Trace Information)
The timeline information of iteration trace data is displayed in the step_trace_*.json file, and the summary information is summarized in the step_trace_*.csv file to determine time-consuming iterations.
This profile data file does not exist in single-operator scenarios (such as the PyTorch scenario).
Availability
step_trace_*.json File
Iteration trace data: step_trace_*.json. You can determine the iteration that takes the longest time based on the iteration length.
The file content is formatted as follows:
Iteration trace data records the software status of a training job and the Ascend AI Software Stack, which can be used to analyze the performance of a training job. If the default two-segment gradient segmentation policy is applied, the iteration traces including fp_start, bp_end, Reduce Start, and Reduce Duration(us) of a training job are printed to describe the job execution status in an iteration.
In offline inference scenarios, FP (start point of the forward propagated operator in iteration traces) and BP (end point of the backward propagated operator in iteration traces) are not collected. In the collection result, FP Start and BP End are displayed as N/A and no timeline exists.

As shown in the preceding figure, to determine the gradient segmentation policy, you need to calculate the difference between bp_end and allreduce1_end as follows: (BP End – Reduce End)/freq (Based on the obtained iteration traces, the first batch of HCCS time is used for calculation.)
|
Field |
Description |
|---|---|
|
Title |
API name of a component. |
|
Start |
Start point on the timeline, which is automatically aligned with that in chrome trace (ms). |
|
Wall Duration |
Time taken by the calls to an API (ms). |
|
Iteration ID |
Iteration ID for graph-based statistics collection. The iteration ID increases by 1 each time a graph is executed. When a script is compiled into multiple graphs, the iteration ID is different from the step ID at the script layer. |
|
FP Start |
FP start time (ns). |
|
Iteration End |
End time of each iteration (ns). |
|
Iteration Time(ns) |
Iteration duration (ns). |
|
BP End |
BP end time (ns). |
|
FP_BP Time |
FP/BP elapsed time (= BP End – FP Start). The unit is ns. |
|
Iteration Refresh |
Iteration refresh lag (= Iteration End – BP End) (ns). |
|
Data_aug Bound |
Data augmentation hangover time (= Current FP Start – Previous Iteration End). The elapsed time of iteration 0 is N/A because the previous Iteration End is absent. |
|
Reduce |
Collective communication elapsed time (may involve groups of iterations). ph:B indicates the start time, and ph:E indicates the end time. If there is only one device, no Reduce data is output. |
Data Read Time Analysis
You can use the GetNext time segments to determine whether the interval between the end of the previous iteration and the start of the current iteration is too large due to slow data reading. See Figure 2.
Only the TensorFlow framework supports this function.
|
Field |
Description |
|---|---|
|
GetNext Start |
Start time of data reading (ns). |
|
GetNext End |
End time of data reading (ns). |
|
GetNext Time(ns) |
Time required for data reading (ns). |
step_trace_*.csv File
The file content is formatted as follows.
Determination based on the step_trace_*.json file can be confirmed with the information contained in the step_trace_*.csv file.
|
Field |
Description |
|---|---|
|
Device_id |
Device ID. |
|
Iteration ID |
Iteration ID for graph-based statistics collection. The iteration ID increases by 1 each time a graph is executed. When a script is compiled into multiple graphs, the iteration ID is different from the step ID at the script layer. |
|
FP Start (μs) |
FP start time (μs). |
|
BP End (μs). |
BP end time (μs). |
|
Iteration End (μs) |
End time of each iteration (μs). |
|
Iteration Time (μs) |
Iteration duration (μs). |
|
FP to BP Time (μs) |
FP/BP elapsed time (= BP End – FP Start) (μs). |
|
Iteration Refresh (μs) |
Iteration refresh lag (= Iteration End – BP End) (μs). |
|
Data Aug Bound (μs) |
Data augmentation hangover time (= Current FP Start – Previous Iteration End) (μs). The elapsed time of iteration 0 is N/A because the previous Iteration End is absent. |
|
Model ID |
Graph ID in the model of a round of iteration. |
|
Reduce Start (μs) |
Start time of collective communication (μs). |
|
Reduce Duration (μs) |
Total duration spent by collective communication. The collective communication duration is divided into two segments according to the default segmentation policy. Reduce Start indicates the start time, and Reduce Duration indicates the duration from the start to the end. The unit is μs. If you use multiple devices, Reduce profile data will not be collected. |
