Auto Tuning Example

Auto Tuning Process

The auto tuning process includes kernel-level auto tuning and application-level auto tuning. For details about the process, see Figure 1. For details, see Kernel-Level Auto Tuning and Application-Level Auto Tuning.

Figure 1 Auto tuning process

Kernel-Level Auto Tuning

This section uses examples/00_basic_matmul of the catlass-v1-dev branch in the template library as an example to describe how to use the Python APIs provided by msKPP to implement kernel-level auto tuning.

If any exception occurs during the running, you can set environment variables to view debug logs and retain intermediate files for exception locating.

export MSKPP_LOG_LEVEL=0

After the operator kernel is developed, the definition and implementation of the kernel function are displayed in the basic_matmul.cpp file, as shown in the following figure.

// basic_matmul.cpp
// ...
template <class LayoutA, class LayoutB, class LayoutC>
ACT_GLOBAL void BasicMatmul(
    GemmCoord problemShape,
    GM_ADDR gmA, LayoutA layoutA,
    GM_ADDR gmB, LayoutB layoutB,
    GM_ADDR gmC, LayoutC layoutC
)
{
 // Kernel implementation
}
// ...

Create the Python script file basic_matmul_autotune.py and compilation script file jit_build.sh in the examples/00_basic_matmul directory by referring to Appendix.

Define the Python API of the operator kernel function as follows: Define the basic_matmul function in the Python script. The input parameters of the function must be the same as those of the kernel function in C++.

# basic_matmul_autotune.py
import mskpp

def get_kernel():
    kernel_file = "./basic_matmul.cpp"
    kernel_name = "BasicMatmul"
    build_script = "./jit_build.sh" # kernel compile script
    config = mskpp.KernelInvokeConfig(kernel_file, kernel_name)
    gen_file = mskpp.Launcher(config).code_gen()
    kernel = mskpp.compile(build_script=build_script, launch_src_file=gen_file)
    return kernel

def basic_matmul(problem_shape, a, layout_a, b, layout_b, c, layout_c):
    # This function's input arguments must exactly match the kernel function.
    kernel = get_kernel()
    blockdim = 20 # use the correct aic number that matches your hardware
    return kernel[blockdim](problem_shape, a, layout_a, b, layout_b, c, layout_c, device_id=1) # invoke the kernel

Construct the kernel input parameters to implement the basic_matmul function.

If the input parameter of the operator kernel function is GM_ADDR, the input parameter needs to be constructed using numpy.array.
If the input parameter of the operator kernel function is a C++ structure object, use ctypes.Structure to construct the same structure in Python.

# basic_matmul_autotune.py
import numpy as np
from ctypes import Structure, c_uint32, c_int32, c_int64
class GemmCoord(Structure):
    _fields_ = [("m", c_uint32),
                ("n", c_uint32),
                ("k", c_uint32)]
    def __init__(self, m, n, k):
        super().__init__()
        self.m = (c_uint32)(m)
        self.n = (c_uint32)(n)
        self.k = (c_uint32)(k)
    @staticmethod
    def get_namespace():
        return "Catlass::"
class RowMajor(Structure):
    _fields_ = [("shape", c_int32 * 2),
                ("stride", c_int64 * 2)]
    def __init__(self, rows : int = 0, cols : int = 0, ldm : int = None):
        super().__init__()
        self.shape = (c_int32 * 2)(rows, cols)
        if ldm is None:
            self.stride = (c_int64 * 2)(cols, 1)
        else:
            self.stride = (c_int64 * 2)((c_int64)(ldm), 1)
    @staticmethod
    def get_namespace():
        return "Catlass::layout::"
if __name__ == "__main__":
    m = 256
    n = 512
    k = 1024
    problem_shape = GemmCoord(m, n, k)
    layout_a = RowMajor(m, k)
    layout_b = RowMajor(k, n)
    layout_c = RowMajor(m, n)
    a = np.random.randint(1, 2, [m, k]).astype(np.half)
    b = np.random.randint(1, 2, [k, n]).astype(np.half)
    c = np.zeros([m, n]).astype(np.half)
    basic_matmul(problem_shape, a, layout_a, b, layout_b, c, layout_c)
    # check if the output tensor c is consistent with the golden data
    golden = np.matmul(a, b)
    is_equal = np.array_equal(c, golden)
    result = "success" if is_equal else "failed"
    print("compare {}.".format(result))

Run the Python script. If the following information is displayed, the operator kernel can be started using the Python API.
```
$ python3 basic_matmul_autotune.py
compare success.
```
Mark the parameters to be tuned in the basic_matmul.cpp operator code program.
Use // tunable at the end of the declaration code line of the template parameter to replace the code after the equal sign (=).
```
using L1TileShape = GemmShape<128, 256, 256>; // tunable
using L0TileShape = GemmShape<128, 256, 64>; // tunable
```
Alternatively, start a new line and add // tunable: alias (L0Shape) at the end of a code line that needs to be replaced. The alias is used to search for a space index.
using L0TileShape = MatmulShape<128, 256, 64>; // tunable: L0Shape

Define the parameter search space by configuring the configs input parameter of the autotune API. Each type of parameter combination replaces the marked operator kernel code lines. Then compile, run, and collect kernel performance data. The following is an example of defining the search space:

The parameters must be properly replaced to avoid compilation or runtime errors.
The parameter replacement principles are as follows (using the first line in configs as an example):
1. Replace the parameters marked with // tunable: L0Shape. Replace the entire line of the marked code (MatmulShape<128, 256, 64>) with the value string (MatmulShape<128, 256, 64>) in configs.
2. Replace the code line marked with // tunable. Replace MatmulShape<128, 256, 256> after the equal sign (=) with the value string MatmulShape<64, 64, 64> in configs.
  - In different scopes, two variables with the same name may be declared. If the two variables both meet the matching rule, only the first variable is modified.
  - If one of the configs fails to be matched, the task corresponding to the config is stopped and an error is reported. However, parameters of the configs that are successfully matched are replaced.

@mskpp.autotune(configs=[ # add and try your own config here for a better kernel performance
    {'L1TileShape': 'GemmShape<128, 256, 256>', 'L0TileShape': 'GemmShape<128, 256, 64>'}, #0 the same config as in basic_matmul.cpp
    {'L1TileShape': 'GemmShape<128, 256, 128>', 'L0TileShape': 'GemmShape<128, 256, 64>'},
    {'L1TileShape': 'GemmShape<128, 128, 256>', 'L0TileShape': 'GemmShape<128, 128, 64>'},
    {'L1TileShape': 'GemmShape<64, 128, 128>', 'L0TileShape': 'GemmShape<64, 128, 128>'},
    {'L1TileShape': 'GemmShape<64, 128, 256>', 'L0TileShape': 'GemmShape<64, 128, 128>'},
    {'L1TileShape': 'GemmShape<64, 128, 512>', 'L0TileShape': 'GemmShape<64, 128, 128>'},
    {'L1TileShape': 'GemmShape<64, 64, 128>', 'L0TileShape': 'GemmShape<64, 64, 128>'},
    {'L1TileShape': 'GemmShape<64, 64, 256>', 'L0TileShape': 'GemmShape<64, 64, 128>'},
    {'L1TileShape': 'GemmShape<64, 64, 512>', 'L0TileShape': 'GemmShape<64, 64, 128>'},
    {'L1TileShape': 'GemmShape<128, 128, 128>', 'L0TileShape': 'GemmShape<128, 128, 128>'},
    {'L1TileShape': 'GemmShape<128, 128, 256>', 'L0TileShape': 'GemmShape<128, 128, 128>'},
    {'L1TileShape': 'GemmShape<128, 128, 512>', 'L0TileShape': 'GemmShape<128, 128, 128>'},
], warmup=1000, repeat=10, device_ids=[0]) # set kernel warmup 1000us

Run the basic_matmul_autotune.py file to run the operator and obtain the time consumed by each parameter combination and the optimal tuning parameter set. The following shows only one possible command output.

# python3 basic_matmul_autotune.py 
No.0: 22.562μs, {'L1TileShape': 'GemmShape<128, 256, 256>', 'L0TileShape': 'GemmShape<128, 256, 64>'}
No.1: 22.109μs, {'L1TileShape': 'GemmShape<128, 256, 128>', 'L0TileShape': 'GemmShape<128, 256, 64>'}
No.2: 17.778μs, {'L1TileShape': 'GemmShape<128, 128, 256>', 'L0TileShape': 'GemmShape<128, 128, 64>'}
No.3: 15.378μs, {'L1TileShape': 'GemmShape<64, 128, 128>', 'L0TileShape': 'GemmShape<64, 128, 128>'}
No.4: 14.982μs, {'L1TileShape': 'GemmShape<64, 128, 256>', 'L0TileShape': 'GemmShape<64, 128, 128>'}
No.5: 15.671μs, {'L1TileShape': 'GemmShape<64, 128, 512>', 'L0TileShape': 'GemmShape<64, 128, 128>'}
No.6: 19.592μs, {'L1TileShape': 'GemmShape<64, 64, 128>', 'L0TileShape': 'GemmShape<64, 64, 128>'}
No.7: 18.340μs, {'L1TileShape': 'GemmShape<64, 64, 256>', 'L0TileShape': 'GemmShape<64, 64, 128>'}
No.8: 18.541μs, {'L1TileShape': 'GemmShape<64, 64, 512>', 'L0TileShape': 'GemmShape<64, 64, 128>'}
No.9: 20.652μs, {'L1TileShape': 'GemmShape<128, 128, 128>', 'L0TileShape': 'GemmShape<128, 128, 128>'}
No.10: 17.728μs, {'L1TileShape': 'GemmShape<128, 128, 256>', 'L0TileShape': 'GemmShape<128, 128, 128>'}
No.11: 17.637μs, {'L1TileShape': 'GemmShape<128, 128, 512>', 'L0TileShape': 'GemmShape<128, 128, 128>'}
Best config: No.4
compare success.

The comparison result shows that No.4 is the optimal parameter set.

Application-Level Auto Tuning

This section uses examples/00_basic_matmul of the master branch in the template library as an example to describe how to use the Python APIs provided by msKPP to implement application-level auto tuning.

If any exception occurs during the running, you can set environment variables to view debug logs and retain intermediate files for exception locating.

export MSKPP_LOG_LEVEL=0

Use the Device layer API of the template library to implement the operator by referring to the examples/00_basic_matmul sample, and add the // tunable comment to the end of lines 115 and 117 to replace the code after the equal sign (=).
```
...
115 using L1TileShape = GemmShape<128, 256, 256>; // tunable
116   
117 using L0TileShape = GemmShape<128, 256, 64>; // tunable
...
```
Create the Python script file basic_matmul_executable_autotune.py and compilation script filejit_build_executable.sh in the examples/00_basic_matmul directory.
You can modify the configs parameter passed by the autotune_v2 API in the basic_matmul_executable_autotune.py script to search for the custom tiling parameter combination.

Run the Python script basic_matmul_executable_autotune.py to obtain the time consumed by each parameter combination and the optimal parameter set. The following shows only one possible command output.

# python3 basic_matmul_executable_autotune.py
No.0: 64.081 us, {'L1TileShape': 'GemmShape<128, 256, 256>', 'L0TileShape': 'GemmShape<128, 256, 64>'}
No.1: 68.041 us, {'L1TileShape': 'GemmShape<256, 128, 256>', 'L0TileShape': 'GemmShape<256, 128, 64>'}
No.2: 60.701 us, {'L1TileShape': 'GemmShape<128, 128, 256>', 'L0TileShape': 'GemmShape<128, 128, 64>'}
No.3: 61.121 us, {'L1TileShape': 'GemmShape<128, 128, 512>', 'L0TileShape': 'GemmShape<128, 128, 64>'}
No.4: 62.361 us, {'L1TileShape': 'GemmShape<64, 256, 128>', 'L0TileShape': 'GemmShape<64, 256, 64>'}
No.5: 60.661 us, {'L1TileShape': 'GemmShape<64, 256, 256>', 'L0TileShape': 'GemmShape<64, 256, 64>'}
No.6: 58.261 us, {'L1TileShape': 'GemmShape<64, 128, 256>', 'L0TileShape': 'GemmShape<64, 128, 64>'}
No.7: 62.381 us, {'L1TileShape': 'GemmShape<128, 128, 256>', 'L0TileShape': 'GemmShape<128, 128, 128>'}
No.8: 62.621 us, {'L1TileShape': 'GemmShape<128, 128, 512>', 'L0TileShape': 'GemmShape<128, 128, 128>'}
No.9: 57.501 us, {'L1TileShape': 'GemmShape<64, 128, 256>', 'L0TileShape': 'GemmShape<64, 128, 128>'}
No.10: 59.281 us, {'L1TileShape': 'GemmShape<64, 128, 512>', 'L0TileShape': 'GemmShape<64, 128, 128>'}
No.11: 65.041 us, {'L1TileShape': 'GemmShape<128, 64, 512>', 'L0TileShape': 'GemmShape<128, 64, 128>'}
No.12: 63.561 us, {'L1TileShape': 'GemmShape<64, 64, 256>', 'L0TileShape': 'GemmShape<64, 64, 256>'}
No.13: 65.121 us, {'L1TileShape': 'GemmShape<64, 64, 512>', 'L0TileShape': 'GemmShape<64, 64, 256>'}
No.14: 65.081 us, {'L1TileShape': 'GemmShape<64, 64, 1024>', 'L0TileShape': 'GemmShape<64, 64, 256>'}
Best config: No.9
autotune results saved in MSKPP_AUTOTUNE_RESULTS_20250604195710.csv

The comparison result shows that No.9 is the optimal parameter set.

Parent topic: Auto Tuning