Analyzing Extreme Performance

Replace Ascendxxxyy in this document with the actual processor type.

The matmul operator is used as an example. In this example, the matrix multiplication of [160, 240] and [240, 80] is prepared, and 25 small matrices of [32, 48] and [48, 16] are split for matrix multiplication. The following is an example of the main.py script implemented by calling the APIs provided by msKPP:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from mskpp import mmad, Tensor, Chip
def my_mmad(gm_x, gm_y, gm_z):
    # Basic data paths for matrix multiplication:
    # Left matrix A: GM-L1-L0A
    # Right matrix B: GM-L1-L0B
    # Result matrix C: L0C (initialized)-GM
    l1_x = Tensor("L1")
    l1_y = Tensor("L1")
    l1_x.load(gm_x)
    l1_y.load(gm_y)
    x = Tensor("L0A")
    y = Tensor("L0B")
    x.load(l1_x)
    y.load(l1_y)
    z = Tensor("L0C", "FP32", [32, 16], format="NC1HWC0")
    out = mmad(x, y, z, True)() # The output needs to be returned.
    z = out[0]
    return z

if __name__ == '__main__':
    with Chip("Ascendxxxyy") as chip:
        chip.enable_trace()    # Enable the operator simulation pipeline chart function to generate the trace.json. file.
        chip.enable_metrics()   # Enable single instruction and pipeline information to generate the Instruction_statistic.csv and Pipe_statistic.csv files.
        # Here comes the processing logic for data tiling, which involves breaking down a large block of GM data into smaller blocks and transferring them in batches.
        # Buffer sharding and multi-buffer transfer are covered by the tiling policy. Here, we simulate the single-buffer scenario.
        # A tiling policy for performing matrix multiplication between [160, 240] and [240, 80], by dividing them into 25 respective small matrices of size [32, 48] and [48, 16] and processing them in batches.
        for _ in range(125):
            in_x = Tensor("GM", "FP16", [32, 48], format="ND")
            in_y = Tensor("GM", "FP16", [48, 16], format="ND")
            in_z = Tensor("GM", "FP32", [32, 16], format="NC1HWC0")
            out_z = my_mmad(in_x, in_y, in_z)
            in_z.load(out_z)

After the main.py script is executed using Python, the trace.json file describing a pipeline chart and instruction_cycle_consumption.html file describing an instruction proportion pie chart are generated in the current directories. You can view the msKPP modeling result.

If there is already a .csv file with the same name in the current directory, the msKPP tool cannot generate deliverables.

Instruction Pipeline Chart

By examining the trace.json file, we can find that in an ideal pipeline, the PIPE-MTE2, which is a performance bottleneck, needs to be able to operate continuously.

Enter chrome://tracing in the address box of Google Chrome, drag the .json file to the blank space to open it, and press the shortcut keys (W: zoom in; S: zoom out; A: move left; D: move right) on the keyboard to view the file.

Figure 1 trace.json

Click the MOV-GM_TO_L1 instruction in the pipeline to view the number of cycles and bandwidth of the instruction under the current transfer volume and calculation volume, as shown in Figure 2.

Figure 2 Instruction details

Instruction Proportion Pie Chart

From the instruction_cycle_consumption.html file, it can be seen that MOV-GM_TO_L1 is the biggest bottleneck among the operators.

Figure 3 Instruction duration statistics