Analyzing Operator Computing and Transfer Specifications

Replace Ascendxxxyy in this document with the actual processor type.

The following case uses the matmul operator as an example. The case handles the matrix multiplication of [160, 240] and [240, 80], which are broken down into five respective smaller matrices of [32, 48], [48, 16], and [32, 16] for efficient multiplication. The following is an example of the main.py script implemented by calling the APIs provided by msKPP:

from mskpp import mmad, Tensor, Chip
def my_mmad(gm_x, gm_y, gm_z):
    # Basic data paths for matrix multiplication:
    # Left matrix x: GM-L1-L0A
    # Right matrix y: GM-L1-L0B
    # Result matrix z: L0C (initialized)-GM
    # Sample mathematical expression: z = x @ y + b
    # Define and allocate variables on L1.
    l1_x = Tensor("L1")
    l1_y = Tensor("L1")
    # Define and allocate variables on L0A and L0B.
    x = Tensor("L0A")
    y = Tensor("L0B")
    # Define and allocate the offset on L0C. Theoretically, the offset should be allocated to the accumulator buffer. Allocating the offset to L0C does not affect the performance.
    b = Tensor("L0C", "FP32", [32, 16], format="NC1HWC0")
    # Move the data on the GM to the memory space corresponding to L1.
    l1_x.load(gm_x)
    l1_y.load(gm_y)
    # Move the left and right matrices on L1 to L0A and L0B.
    x.load(l1_x)
    y.load(l1_y)
    # The current data has been loaded to L0A and L0B. Call the calculation instruction and save the result to L0C. out is the variable allocated by the mmad function in L0C.
    out = mmad(x, y, b, True)()
    # Move the data on L0C to the address space of the GM variable gm_z.
    gm_z.load(out[0])
    return gm_z
if __name__ == '__main__':
    with Chip("Ascendxxxyy") as chip:
        chip.enable_trace() # Enable the operator simulation pipeline chart function to generate the trace.json. file.
        chip.enable_metrics() # Enable single instruction and pipeline information to generate the Instruction_statistic.csv and Pipe_statistic.csv files.
        # Simulate the scenario where a large matrix is split into five small matrices for computation.
        for _ in range(5):
            # Use the operator for AI Core computation.
            in_x = Tensor("GM", "FP16", [32, 48], format="ND")
            in_y = Tensor("GM", "FP16", [48, 16], format="ND")
            in_z = Tensor("GM", "FP32", [32, 16], format="NC1HWC0")
            my_mmad(in_x, in_y, in_z)

After the main.py script is executed using Python, the Pipe_statistic.csv file for pipeline statistics and Instruction_statistic.csv file for instruction statistics are generated in the current path/MSKPPTIMESTAMP directory. You can view the msKPP modeling results.

TIMESTAMP indicates the current timestamp.

Transfer Pipeline Statistics

The Pipe_statistic.csv file collects statistics on the total amount of transferred data, number of operations, and time consumptions of different pipelines.

Figure 1 Pipe_statistic.csv

See the following table for more details.

**Table 1** Field description
Field	Description
Pipe	Name of a pipe unit in an Ascend processor.
Duration(us)	Pipeline time consumption (unit: μs).
Cycle	Number of cycles consumed each time an instruction is executed.
Size(B)	Transfer volume of a transfer pipeline (unit: B).
Ops	Size of a calculation element in pipelines of the calculation class.

For the pipeline that takes the longest time and clearly bottlenecks transfer performance, the optimization roadmap is as follows:

If a large amount of data needs to be transferred, maximize the data transferred at once to fully utilize the transfer bandwidth.
Ensure that the pipeline with the performance bottleneck remains continuously operational.

Instruction Statistics

The Instruction_statistic.csv file collects statistics on the total amount of transferred data, number of operations, and time consumptions across different instruction dimensions. It can be found that the bottleneck at the instruction layer lies in MOV-GM_TO_L1 (belonging to PIPE-MTE2). This helps pinpoint the performance bottleneck from the instruction layer.

Figure 2 Instruction_statistic.csv

See the following table for more details.

**Table 2** Field description
Field	Description
Instruction	Instruction name.
Duration(us)	Pipeline time consumption (unit: μs).
Cycle	Number of cycles consumed each time an instruction is executed.
Size(B)	Transfer volume of a transfer pipeline (unit: B).
Ops	Size of a calculation element in pipelines of the calculation class.

Parent topic: Performance Modeling