Preliminary Design of Operator Tiling

The simulation of a tiling policy is reflected in the for loop of an operator function. During tiling, ensure that the amount of data processed in each for loop is the same.

Replace Ascendxxxyy in this document with the actual processor type.

Procedure

The matmul operator is used as an example. This case simulates the scenario where a large matrix is split into small matrices for matrix multiplication. The operator function needs to be implemented based on the user operator logic solution. As mentioned before, the simulation of a tiling policy is reflected in the for loop (bold part in the following code) of an operator function. For example, if the matrix multiplication of [160, 240] and [240, 80] is processed on a single core, 25 respective smaller matrices of [32, 48] and [48, 16] are required. That is, the for loop needs to be executed 125 times, while creating a GM tensor with a size of [32, 48] and [48, 16] each time.

from mskpp import mmad, Tensor, Chip
def my_mmad(gm_x, gm_y, gm_z):
    # Basic data paths for matrix multiplication:
    # Left matrix A: GM-L1-L0A
    # Right matrix B: GM-L1-L0B
    # Result matrix C: L0C (initialized)-GM
    l1_x = Tensor("L1")
    l1_y = Tensor("L1")
    l1_x.load(gm_x)
    l1_y.load(gm_y)
    x = Tensor("L0A")
    y = Tensor("L0B")
    x.load(l1_x)
    y.load(l1_y)
    z = Tensor("L0C", "FP32", [32, 16], format="NC1HWC0")
    out = mmad(x, y, z, True)() # The output needs to be returned.
    z = out[0]
    return z

if __name__ == '__main__':
    with Chip("Ascendxxxyy") as chip:
        chip.enable_trace()    # Enable the operator simulation pipeline chart function to generate the trace.json. file.
        chip.enable_metrics()   # Enable single instruction and pipeline information to generate the Instruction_statistic.csv and Pipe_statistic.csv files.
        # Here comes the processing logic for data tiling, which involves breaking down a large block of GM data into smaller blocks and transferring them in batches.
        # Buffer sharding and multi-buffer transfer are covered by the tiling policy. Here, we simulate the single-buffer scenario.
       # Multiply the matrix of [160, 240] and [240, 80] into 25 small matrices of [32, 48] and [48, 16] for batch operation.
        for _ in range(125):
            in_x = Tensor("GM", "FP16", [32, 48], format="ND")
            in_y = Tensor("GM", "FP16", [48, 16], format="ND")
            in_z = Tensor("GM", "FP32", [32, 16], format="NC1HWC0")
            out_z = my_mmad(in_x, in_y, in_z)
            in_z.load(out_z)

Parent topic: msKPP (Operator Design)