Analyzing Operator Computing and Transfer Specifications

Replace Ascendxxxyy in this document with the actual processor type.

The matmul operator is used as an example. In this example, the matrix multiplication of [160, 240] and [240, 80] is prepared, and 25 small matrices of [32, 48] and [48, 16] are split for matrix multiplication. The following is an example of the main.py script implemented by calling the APIs provided by msKPP:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
from mskpp import mmad, Tensor, Chip
def my_mmad(gm_x, gm_y, gm_z):
    # Basic data paths for matrix multiplication:
    # Left matrix A: GM-L1-L0A
    # Right matrix B: GM-L1-L0B
    # Result matrix C: L0C (initialized)-GM
    l1_x = Tensor("L1")
    l1_y = Tensor("L1")
    l1_x.load(gm_x)
    l1_y.load(gm_y)
    x = Tensor("L0A")
    y = Tensor("L0B")
    x.load(l1_x)
    y.load(l1_y)
    z = Tensor("L0C", "FP32", [32, 16], format="NC1HWC0")
    out = mmad(x, y, z, True)() # The output needs to be returned.
    z = out[0]
    return z

if __name__ == '__main__':
    with Chip("Ascendxxxyy") as chip:
        chip.enable_trace()    # Enable the operator simulation pipeline chart function to generate the trace.json. file.
        chip.enable_metrics()   # Enable single instruction and pipeline information to generate the Instruction_statistic.csv and Pipe_statistic.csv files.
        # Here comes the processing logic for data tiling, which involves breaking down a large block of GM data into smaller blocks and transferring them in batches.
        # Buffer sharding and multi-buffer transfer are covered by the tiling policy. Here, we simulate the single-buffer scenario.
        # A tiling policy for performing matrix multiplication between [160, 240] and [240, 80], by dividing them into 25 respective small matrices of size [32, 48] and [48, 16] and processing them in batches.
        for _ in range(125):
            in_x = Tensor("GM", "FP16", [32, 48], format="ND")
            in_y = Tensor("GM", "FP16", [48, 16], format="ND")
            in_z = Tensor("GM", "FP32", [32, 16], format="NC1HWC0")
            out_z = my_mmad(in_x, in_y, in_z)
            in_z.load(out_z)

After the main.py script is executed using Python, the Pipe_statistic.csv file for pipeline statistics and Instruction_statistic.csv file for instruction statistics are generated in the current directory. You can view the msKPP modeling results.

If there is already a .csv file with the same name in the current directory, the msKPP tool cannot generate deliverables.

Transfer Pipeline Statistics

The Pipe_statistic.csv file collects statistics on the total amount of transferred data, number of operations, and time consumptions of different pipelines.
Figure 1 Pipe_statistic.csv
See the following table for more details.
Table 1 Field description

Field

Description

Pipe

Name of a pipe unit in an Ascend processor.

Duration(us)

Pipeline time consumption (unit: μs).

Cycle

Number of cycles consumed each time an instruction is executed.

Size(B)

Transfer volume of a transfer pipeline (unit: B).

Ops

Size of a calculation element in pipelines of the calculation class.

For the pipeline that takes the longest time and clearly bottlenecks transfer performance, the optimization roadmap is as follows:

  • If a large amount of data needs to be transferred, maximize the data transferred at once to fully utilize the transfer bandwidth.
  • Ensure that the pipeline with the performance bottleneck remains continuously operational.

Instruction Statistics

The Instruction_statistic.csv file collects statistics on the total amount of transferred data, number of operations, and time consumptions across different instruction dimensions. It can be found that the bottleneck at the instruction layer lies in MOV-GM_TO_L1 (belonging to PIPE-MTE2). This helps pinpoint the performance bottleneck from the instruction layer.

Figure 2 Instruction_statistic.csv
See the following table for more details.
Table 2 Field description

Field

Description

Instruction

Instruction name.

Duration(us)

Pipeline time consumption (unit: μs).

Cycle

Number of cycles consumed each time an instruction is executed.

Size(B)

Transfer volume of a transfer pipeline (unit: B).

Ops

Size of a calculation element in pipelines of the calculation class.