Analyzing Extreme Performance
Replace Ascendxxxyy in this document with the actual processor type.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 |
from mskpp import mmad, Tensor, Chip def my_mmad(gm_x, gm_y, gm_z): # Basic data paths for matrix multiplication: # Left matrix x: GM-L1-L0A # Right matrix y: GM-L1-L0B # Result matrix z: L0C (initialized)-GM # Sample mathematical expression: z = x @ y + b # Define and allocate variables on L1. l1_x = Tensor("L1") l1_y = Tensor("L1") # Define and allocate variables on L0A and L0B. x = Tensor("L0A") y = Tensor("L0B") # Define and allocate the offset on L0C. Theoretically, the offset should be allocated to the accumulator buffer. Allocating the offset to L0C does not affect the performance. b = Tensor("L0C", "FP32", [32, 16], format="NC1HWC0") # Move the data on the GM to the memory space corresponding to L1. l1_x.load(gm_x) l1_y.load(gm_y) # Move the left and right matrices on L1 to L0A and L0B. x.load(l1_x) y.load(l1_y) # The current data has been loaded to L0A and L0B. Call the calculation instruction and save the result to L0C. out is the variable allocated by the mmad function in L0C. out = mmad(x, y, b, True)() # Move the data on L0C to the address space of the GM variable gm_z. gm_z.load(out[0]) return gm_z if __name__ == '__main__': with Chip("Ascendxxxyy") as chip: chip.enable_trace() # Enable the operator simulation pipeline chart function to generate the trace.json file. chip.enable_metrics() # Enable single instruction and pipeline information to generate the Instruction_statistic.csv and Pipe_statistic.csv files. # Simulate the scenario where a large matrix is split into five small matrices for computation. for _ in range(5): # Use the operator for AI Core computation. in_x = Tensor("GM", "FP16", [32, 48], format="ND") in_y = Tensor("GM", "FP16", [48, 16], format="ND") in_z = Tensor("GM", "FP32", [32, 16], format="NC1HWC0") my_mmad(in_x, in_y, in_z) |
After the main.py script is executed using Python, the instruction pipeline chart (trace.json) and instruction proportion pie chart (instruction_cycle_consumption.html) are generated in the current path/MSKPPTIMESTAMP directory. You can view the msKPP modeling result.
TIMESTAMP indicates the current timestamp.
Instruction Pipeline Chart
Enter chrome://tracing in the address box of Google Chrome, drag the .json file to the blank space to open it, and press the shortcut keys (W: zoom in; S: zoom out; A: move left; D: move right) on the keyboard to view the file.
Click the MOV-GM_TO_L1 instruction in the pipeline to view the number of cycles and bandwidth of the instruction under the current transfer volume and calculation volume, as shown in Figure 2.
Instruction Proportion Pie Chart
From the instruction_cycle_consumption.html file, it can be seen that MOV-GM_TO_L1 is the biggest bottleneck among the operators.

