调试PyTorch接口调用的算子

展示如何使用msdebug工具来上板调试一个PyTorch接口调用的add算子,该add算子可实现两个向量相加并输出结果的功能。

前提条件

操作步骤

  1. 编译算子。

    1. 进入AddCustom算子工程,把AddCustom/CMakePresets.json中的编译配置从Release修改为Debug。
      # CMakePresets.json
      ...
          "configurePresets": [
              {
                  "name": "default",
                  ...
                      "CMAKE_BUILD_TYPE": {
                          "type": "STRING",
                          "value": "Debug" # 编译配置修改为Debug
                      }
    2. 执行/build.sh命令完成算子的编译和部署,编译成功后,会在当前目录下创建build_out目录,并在该目录下生成自定义算子安装包custom_opp_<target os>_<target architecture>.run。
      bash build.sh
      bash ./build_out/custom_opp_ubuntu_aarch64.run  
    3. 进入PyTorch接入工程,使用PyTorch调用方式调用AddCustom算子工程,并按照指导完成编译。
      PytorchInvocation
      ├── op_plugin_patch         
      ├── run_op_plugin.sh      //  4.执行样例时,需要使用
      └── test_ops_custom.py    //  步骤2启动工具时,需要使用
    4. 执行样例,样例执行过程中会自动生成测试数据,然后运行pytorch样例,最后检验运行结果。
      bash run_op_plugin.sh
      -- CMAKE_CCE_COMPILER: /usr/local/Ascend/ascend-toolkit/latest/toolkit/tools/ccec_compiler/bin/ccec
      -- CMAKE_CURRENT_LIST_DIR: ${INSTALL_DIR}/AddKernelInvocation/cmake/Modules
      -- ASCEND_PRODUCT_TYPE:
        Ascendxxxyy
      -- ASCEND_CORE_TYPE:
        VectorCore
      -- ASCEND_INSTALL_PATH:
        /usr/local/Ascend/ascend-toolkit/latest
      -- The CXX compiler identification is GNU 10.3.1
      -- Detecting CXX compiler ABI info
      -- Detecting CXX compiler ABI info - done
      -- Check for working CXX compiler: /usr/bin/c++ - skipped
      -- Detecting CXX compile features
      -- Detecting CXX compile features - done
      -- Configuring done
      -- Generating done
      -- Build files have been written to: ${INSTALL_DIR}/AddKernelInvocation/build
      Scanning dependencies of target add_npu
      [ 33%] Building CCE object cmake/npu/CMakeFiles/add_npu.dir/__/__/add_custom.cpp.o
      [ 66%] Building CCE object cmake/npu/CMakeFiles/add_npu.dir/__/__/main.cpp.o
      [100%] Linking CCE executable ../../../add_npu
      [100%] Built target add_npu
      ${INSTALL_DIR}/AddKernelInvocation
      INFO: compile op on ONBOARD succeed!
      INFO: execute op on ONBOARD succeed!
      test pass

  2. 手动导入算子调试信息。

    • ${INSTALL_DIR}请替换为CANN软件安装后文件存储路径。例如,若安装的Ascend-cann-toolkit软件包,则安装后文件存储路径为:$HOME/Ascend/ascend-toolkit/latest。
    • 在安装昇腾AI处理器的服务器执行npu-smi info命令进行查询,获取Chip Name信息。实际配置值为AscendChip Name,例如Chip Name取值为xxxyy,实际配置值为Ascendxxxyy
    (msdebug) export 
    LAUNCH_KERNEL_PATH=${INSTALL_DIR}/opp/vendors/customize/op_impl/ai_core/tbe/kernel/SOC_VERSION/add_custom/AddCustom_1e04ee05ab491cc5ae9c3d5c9ee8950b.o

  3. 启动msdebug工具拉起Python程序,进入调试界面。

    msdebug python3 test_ops_custom.py
    (msdebug) target create "python3"
    Current executable set to '/home/mindstudio/miniconda3/envs/py37/bin/python3' (aarch64).
    (msdebug) settings set -- target.run-args  "test_ops_custom.py"
    (msdebug)

  4. 设置断点。

    根据指定源码文件与对应行号,在核函数中设置NPU断点。
    • 如果host侧和kernel侧存在同名的算子实现文件,在设置断点时,推荐采用绝对路径进行设置,确保断点打在预期的文件上。
    • 在对host侧源码文件进行打点时,可能会出现找不到实际位置的告警,类似如下提示:
      (msdebug) b /home/xx/op_host/add_custom.cpp:24
      Breakpoint 1: no locations (pending).
      WARNING:  Unable to resolve breakpoint to any actual locations.
      (msdebug)

      在算子运行后,会自动找到实际位置,并自动设置断点。

    (msdebug) b add_custom.cpp:60
    Breakpoint 1: where = AddCustom_1e04ee05ab491cc5ae9c3d5c9ee8950b.o`::AddCustom_1e04ee05ab491cc5ae9c3d5c9ee8950b_1(uint8_t *, uint8_t *, uint8_t *, uint8_t *, uint8_t *) + 9912 [inlined] KernelAdd::Compute(int) + 3400 at add_custom.cpp:60:9, address = 0x00000000000026b8

  5. 运行程序,等待直到命中断点。

    (msdebug) r
    Process 197189 launched: '/home/miniconda3/envs/py38/bin/python3' (aarch64)
    Process 197189 stopped and restarted: thread 1 received signal: SIGCHLD
    ...
    [Launch of Kernel anonymous on Device 0]
    Process 197189 stopped
    [Switching to focus on Kernel anonymous, CoreId 8, Type aiv]
    * thread #1, name = 'python3', stop reason = breakpoint 2.1
        frame #0: 0x00000000000026b8 AddCustom_1e04ee05ab491cc5ae9c3d5c9ee8950b.o`::AddCustom_1e04ee05ab491cc5ae9c3d5c9ee8950b_1(uint8_t *, uint8_t *, uint8_t *, uint8_t *, uint8_t *) [inlined] KernelAdd::Compute(this=0x000000000020efb8, progress=1) at add_custom.cpp:60:9
       57              LocalTensor<DTYPE_Y> yLocal = inQueueY.DeQue<DTYPE_Y>();
       58              LocalTensor<DTYPE_Z> zLocal = outQueueZ.AllocTensor<DTYPE_Z>();
       59              Add(zLocal, xLocal, yLocal, this->tileLength);
    -> 60              outQueueZ.EnQue<DTYPE_Z>(zLocal);
       61              inQueueX.FreeTensor(xLocal);
       62              inQueueY.FreeTensor(yLocal);
       63          }
    (msdebug)

    其他调试操作可参考导入调试信息打印内存与变量msdebug调试信息展示核切换等,与其操作一致。

  6. 删除断点,具体操作请参见删除行断点
  7. 调试完以后,执行q命令并输入Y或y结束调试。

    (msdebug) q
    Quitting LLDB will kill one or more processes. Do you really want to proceed: [Y/n] y