Quick Start
This section uses the single-operator 00_basic_matmul as an example to help you quickly get started with the kernel-level auto tuning function of msKPP.
Procedure
- Run the following command to download the Ascend C template library from here:
git clone https://gitcode.com/cann/catlass.git -b catlass-v1-stable
- Go to the 00_basic_matmul sample code directory in the template library.
cd catlass/examples/00_basic_matmul
- Modify the basic_matmul.cpp file and add comments (// tunable) to the end of the L1TileShape and L0TileShape variable declaration lines.
// basic_matmul.cpp ... 51 using L1TileShape = GemmShape<128, 256, 256>; // tunable 52 using L0TileShape = GemmShape<128, 256, 64>; // tunable ...
- Save the Python script file basic_matmul_autotune.py and compilation script file jit_build.sh in the appendix to the 00_basic_matmul directory.
- Run the sample script basic_matmul_autotune.py.
$ python3 basic_matmul_autotune.py No.0: 22.562μs, {'L1TileShape': 'GemmShape<128, 256, 256>', 'L0TileShape': 'GemmShape<128, 256, 64>'} No.1: 22.109μs, {'L1TileShape': 'GemmShape<128, 256, 128>', 'L0TileShape': 'GemmShape<128, 256, 64>'} No.2: 17.778μs, {'L1TileShape': 'GemmShape<128, 128, 256>', 'L0TileShape': 'GemmShape<128, 128, 64>'} No.3: 15.378μs, {'L1TileShape': 'GemmShape<64, 128, 128>', 'L0TileShape': 'GemmShape<64, 128, 128>'} No.4: 14.982μs, {'L1TileShape': 'GemmShape<64, 128, 256>', 'L0TileShape': 'GemmShape<64, 128, 128>'} No.5: 15.671μs, {'L1TileShape': 'GemmShape<64, 128, 512>', 'L0TileShape': 'GemmShape<64, 128, 128>'} No.6: 19.592μs, {'L1TileShape': 'GemmShape<64, 64, 128>', 'L0TileShape': 'GemmShape<64, 64, 128>'} No.7: 18.340μs, {'L1TileShape': 'GemmShape<64, 64, 256>', 'L0TileShape': 'GemmShape<64, 64, 128>'} No.8: 18.541μs, {'L1TileShape': 'GemmShape<64, 64, 512>', 'L0TileShape': 'GemmShape<64, 64, 128>'} No.9: 20.652μs, {'L1TileShape': 'GemmShape<128, 128, 128>', 'L0TileShape': 'GemmShape<128, 128, 128>'} No.10: 17.728μs, {'L1TileShape': 'GemmShape<128, 128, 256>', 'L0TileShape': 'GemmShape<128, 128, 128>'} No.11: 17.637μs, {'L1TileShape': 'GemmShape<128, 128, 512>', 'L0TileShape': 'GemmShape<128, 128, 128>'} Best config: No.4 compare success.The data shows that setting L1TileShape to GemmShape<64, 128, 256> and L0TileShape to GemmShape<64, 128, 128> in basic_matmul.cpp delivers the best performance.
Parent topic: Auto Tuning