Quick Start

This section uses the single-operator 00_basic_matmul as an example to help you quickly get started with the kernel-level auto tuning function of msKPP.

Procedure

  1. Run the following command to download the Ascend C template library from here:
    git clone https://gitcode.com/cann/catlass.git -b catlass-v1-stable
  2. Go to the 00_basic_matmul sample code directory in the template library.
    cd catlass/examples/00_basic_matmul
  3. Modify the basic_matmul.cpp file and add comments (// tunable) to the end of the L1TileShape and L0TileShape variable declaration lines.
    // basic_matmul.cpp
    ...
    51 using L1TileShape = GemmShape<128, 256, 256>; // tunable
    52 using L0TileShape = GemmShape<128, 256, 64>; // tunable
    ...
  4. Save the Python script file basic_matmul_autotune.py and compilation script file jit_build.sh in the appendix to the 00_basic_matmul directory.
  5. Run the sample script basic_matmul_autotune.py.
    $ python3 basic_matmul_autotune.py 
    No.0: 22.562μs, {'L1TileShape': 'GemmShape<128, 256, 256>', 'L0TileShape': 'GemmShape<128, 256, 64>'}
    No.1: 22.109μs, {'L1TileShape': 'GemmShape<128, 256, 128>', 'L0TileShape': 'GemmShape<128, 256, 64>'}
    No.2: 17.778μs, {'L1TileShape': 'GemmShape<128, 128, 256>', 'L0TileShape': 'GemmShape<128, 128, 64>'}
    No.3: 15.378μs, {'L1TileShape': 'GemmShape<64, 128, 128>', 'L0TileShape': 'GemmShape<64, 128, 128>'}
    No.4: 14.982μs, {'L1TileShape': 'GemmShape<64, 128, 256>', 'L0TileShape': 'GemmShape<64, 128, 128>'}
    No.5: 15.671μs, {'L1TileShape': 'GemmShape<64, 128, 512>', 'L0TileShape': 'GemmShape<64, 128, 128>'}
    No.6: 19.592μs, {'L1TileShape': 'GemmShape<64, 64, 128>', 'L0TileShape': 'GemmShape<64, 64, 128>'}
    No.7: 18.340μs, {'L1TileShape': 'GemmShape<64, 64, 256>', 'L0TileShape': 'GemmShape<64, 64, 128>'}
    No.8: 18.541μs, {'L1TileShape': 'GemmShape<64, 64, 512>', 'L0TileShape': 'GemmShape<64, 64, 128>'}
    No.9: 20.652μs, {'L1TileShape': 'GemmShape<128, 128, 128>', 'L0TileShape': 'GemmShape<128, 128, 128>'}
    No.10: 17.728μs, {'L1TileShape': 'GemmShape<128, 128, 256>', 'L0TileShape': 'GemmShape<128, 128, 128>'}
    No.11: 17.637μs, {'L1TileShape': 'GemmShape<128, 128, 512>', 'L0TileShape': 'GemmShape<128, 128, 128>'}
    Best config: No.4
    compare success.

    The data shows that setting L1TileShape to GemmShape<64, 128, 256> and L0TileShape to GemmShape<64, 128, 128> in basic_matmul.cpp delivers the best performance.