Performing Range-Level Replay with mstx APIs

This section demonstrates how to use the msProf tool and the mstx APIs to implement range-level replay to retain the L2 cache information in the context during operator execution.

Prerequisites

Prepare an operator project and add the mstx extended APIs to the operator code to determine the replay range. For details, see mstx Extended Functions and MindStudio mstx API Reference.

The mstxRangeStartA and mstxRangeEnd APIs must be called in pairs and cannot be nested across. The operators contained in each pair of mstx APIs form a replay range. The streams of the operators in the replay range cannot be changed.
The number of operators that can be collected in each replay range is limited by the number of operator block dims in OpBasicInfo (Basic Operator Information). It is recommended that the number of operators be less than or equal to 50.
This function cannot be enabled together with --aic-metrics=MemoryDetail, --aic-metrics=TimelineDetail, or --aic-metrics=Source. You are advised not to enable this function together with --kill=on because it may result in missing operator data.
During range-level replay, the SynchronizeStream operator may fail to be executed. You are advised to execute the operator after the mstxRangeEnd API call ends.
This function applies only to Atlas A3 training products/Atlas A3 inference products and Atlas A2 training products/Atlas A2 inference products.

Example

The Python API (test.py file) is used as an example to describe how msProf works with mstx APIs to implement range-level replay.

import mstx
import torch
import torch_npu
 
x = torch.Tensor([1,2,3,4]).npu()
y = torch.Tensor([1,2,3,4]).npu()

a = x + y
range1_id = mstx.range_start("range1", None)
b = a - x
c = a * x
mstx.range_end(range1_id)
range2_id = mstx.range_start("range2", None)
d = x / y
range3_id = mstx.range_start("range3", None)
e = torch.abs(y)
mstx.range_end(range3_id)
f = x + e
mstx.range_end(range2_id)

Procedure

Single-range replay

Run the following command to enable a single mstx API range. The following command performs range-level replay for "range1."
```
msprof op --replay-mode=range --mstx=on --mstx-include="range1" --launch-count=10 python3 test.py
```

The tool generates the tuning data of the Sub and Mul operators, and the L2 cache information between the two operators is retained. For details about the performance files, see Table 2.

OPPROF_{timestamp}_XXX
├── Mul_XXX // Mul_XXX is the name of the collection operator.
│   └── 0
│       ├── dump
                ...
│       └── visualize_data.bin
└── Sub_XXX
    └── 0
        ├── dump
               ...
        └── visualize_data.bin

Multi-range replay

Run the following command to enable all mstx API ranges:

msprof op --replay-mode=range --mstx=on --launch-count=10 python3 test.py

The tool executes range-level replay for "range1" and "range2" sequentially, generating tuning data for Sub, Mul, Div, Abs, and Add operators. The L2 cache information during each replay is retained, but the L2 cache information during two replays is independent of each other. However, because "range2" and "range3 overlap, only the first range takes effect, and "range3" is invalid. For details about the performance files, see Table 2.

OPPROF_{timestamp}_XXX
├── Abs_XXX  // Abs_XXX is the name of the collection operator.
│   └── 0
│       ├── dump
                ...
│       └── visualize_data.bin
├── Add_XXX
│   └── 0
│       ├── dump
                ...
│       └── visualize_data.bin
├── Mul_XXX
│   └── 0
│       ├── dump
                ...
│       └── visualize_data.bin
├── RealDiv_XXX
│   └── 0
│       ├── dump
                ...
│       └── visualize_data.bin
└── Sub_XXX
    └── 0
        ├── dump
               ...
        └── visualize_data.bin

Parent topic: Typical Cases