昇腾社区首页
中文
注册

单算子运行报错

分析结果

如果info.txt中给出如下结论,说明是单算子运行报错或pass异常。

**********************Root cause conclusion******************
Single op test aicore error, please check op.

此时可以结合info.txt文件中的以下信息,通过单算子复现的方法进行问题处理。还可以执行python3 test_single_op.py进一步查看异常算子执行过程中的报错信息,便于分析异常算子的报错原因。

  1. Basic information中给出的算子信息,示例如下:
    ***********************1. Basic information********************
    error time   : 2023-06-09-06:55:34.798.772
    device id    : 0
    core id      : 0
    task id      : 6
    stream id    : 7
    node name    : GatherV2
    kernel name  : te_gatherv2_657cb48fa1743a43209d7bc779fe8c294760a5b09b3079a3323fdf18376fc408_1
  2. AICERROR code中给出的详细报错信息,示例如下:
    ***********************2. AICERROR code***********************
    error code  : 0x10
    error bits :
    CCU_ERR_INFO: 0x2c6290000324442
    ccu_err_addr bit[22:8]=011001001000100  meaning:CCU Error Address [17:3]  approximate:0x19220
    ccu_illegal_instr,非法执行:1.指令的binary错误  2.指令地址非对齐
  3. Instructions中给出的大概报错的cce行号,示例如下:
    ***********************3. Instructions************************
    start   pc   : 0x1000124080064000
    current pc   : 0x124080067d2c
    instruction  :
    Error occured most likely at line: 3d08
    
    /home/HwHiAiUser/tf/info_20230609065654/aicerror_0_20230609065534/te_gatherv2_657cb48fa1743a43209d7bc779fe8c294760a5b09b3079a3323fdf18376fc408_1.o.txt:3d08
    te_gatherv2_657cb48fa1743a43209d7bc779fe8c294760a5b09b3079a3323fdf18376fc408_1.cce:1364
    /usr/local/Ascend/latest/opp/built-in/op_impl/ai_core/tbe/impl/dynamic/gather_v2.py:1214
    /usr/local/Ascend/latest/opp/built-in/op_impl/ai_core/tbe/impl/dynamic/gather_v2.py:1214
    
    related instructions (error occured before the mark *):
    
    3d08: <not available>
    3d0c: <not available>
    3d10: <not available>
    3d14: <not available>
    3d18: <not available>
    3d1c: <not available>
    3d20: <not available>
    3d24: <not available>
    3d28: <not available>
    *   3d2c: <not available>
    
    For complete instructions, please view /home/tf/info_20230609065654/aicerror_0_20230609065534/te_gatherv2_657cb48fa1743a43209d7bc779fe8c294760a5b09b3079a3323fdf18376fc408_1.o.txt

故障根因

出现该问题,可能由以下原因:

  • 算子自身逻辑错误,比如读写一些未分配的地址。
  • 编译器错误,算子的逻辑正确,但编译器将其编译成了错误的底层汇编指令,导致芯片报AI Core error。
  • 输入数据错误。通常输入数据是作为索引输入,错误的索引导致算子的非法访问。
  • tilingdata数据计算或选择问题。

    msaicerr工具提供了一键解析tilingdata的能力。进入“Toolkit包安装路径${install_path}/tools/msaicerr”目录,执行如下命令,其plog日志文件路径需根据实际情况替换:

    python3 msaicerr.py -t plog_xxxxxx.log

    工具执行结果如下:

    root@ubuntu:/home/HwHiAiUser/tf/aicerr/asys_output_20240124171303003/dfx/log/host/cann/debug/plog# python3 /usr/local/Ascend/latest/tools/msaicerr/msaicerr.py -t plog-141054_20240124171303866.log
    2024-04-25 11:30:43 (488) - [INFO] The tool directory will be used to as the output address of the analysis report. 
    2024-04-25 11:30:43 (488) - [INFO] Total device count: 1
    2024-04-25 11:30:43 (488) - [INFO] Start to get tiling date.
    2024-04-25 11:30:43 (488) - [INFO] Failed to execute command:['grep', 'exception info dump args data', '-inrE', 'plog-141054_20240124171303866.log'].
    2024-04-25 11:30:43 (488) - [INFO] Unable to get L0 dump log, start to guess tilingptr index.
    2024-04-25 11:30:43 (488) - [INFO] Find tiling_data index: 13
    2024-04-25 11:30:43 (488) - [INFO] get args: ['0x1240413d0000', '0x12404133e0000',  '0x12404133c0000',  '0x124100090068', '0', '0', '0', '0', '0', '0', '0', '0', '0x1', '0x1', '0x40', '0x1', '0x2', '0x1', '0x1'], tiling_ptr: 0x124100090068, offset: 13
    2024-04-25 11:30:43 (488) - [INFO] Get tiling data success!
    2024-04-25 11:30:43 (488) - [INFO] tiling data in uint32: tiling data: [1, 0, 1, 0, 64, 0, 1, 0, 2, 0, 1, 0, 1, 0]
    tiling data in int64: tiling data: [1, 1, 64, 1, 2, 1, 1]
    tiling data in float16: tiling data: [5.960464477539063e-08, 0.0, 0.0, 0.0, 5.960464477539063e-08, 0.0, 0.0, 0.0, 3.814697265625e-06, 0.0, 0.0, 0.0, 5.960464477539063e-08, 0.0, 0.0, 0.0, 1.1920958955078125e-07, 0.0, 0.0, 0.0, 5.960464477539063e-08, 0.0, 0.0, 0.0, 5.960464477539063e-08, 0.0, 0.0, 0.0]
     
    2024-04-25 11:30:43 (488) - [INFO] Tiling data saved in tilingdata_1714015843.bin

    由于工具无法知道tiling的数据类型,因此,使用int32、int64、float16均做了解析。算子开发者可以根据tiling的数据类型选择查看。并且根据tiling参数的含义判断tiling是否正确,例如,tilingdata中有core_num的个数,如果该数字出现非32、48这类的数字,可以认为tiling数据有问题。

处理方法

此类问题可以通过单算子复现的方法定位处理:

在执行msaicerr.py工具分析结束时,会生成一个单算子测试脚本(如下加粗字体所示),开发人员可执行该脚本复现AI Core error现象。

2023-06-09 06:56:58 (101494) - [INFO] The find single op log /home/HwHiAiUser/tf/single_op_log_20230609065654/debug/plog/plog-101494_20230609065657791.log
2023-06-09 06:56:58 (101494) - [INFO] Generate case file /home/HwHiAiUser/AicoreError/tools/msaicerr/test_single_op.py
2023-06-09 06:56:58 (101494) - [INFO] ##################################################
2023-06-09 06:56:58 (101494) - [INFO] single op test failed! Please check OP or input data!
2023-06-09 06:56:58 (101494) - [INFO] Run 'export PYTHONPATH=/usr/local/Ascend/CANN-7.3/tools/msaicerr/:$PYTHONPATH;cd /usr/local/Ascend/CANN-7.3/tools/msaicerr;python3 /home/xxxxxxx/xxx/info_xxxx/aicerror_xxxx/test_single_op.py' can test op!
2023-06-09 06:56:58 (101494) - [INFO] ##################################################
2023-06-09 06:56:58 (101494) - [INFO] The ai core error info for No.0 is saved in /home/HwHiAiUser/tf/info_20230609065654/aicerror_0_20230609065534/info.txt
2023-06-09 06:56:58 (101494) - [INFO] Finish to analyze each ai core error.
2023-06-09 06:56:58 (101494) - [INFO] The summary info is saved in /home/HwHiAiUser/tf/info_20230609065654/README.txt
2023-06-09 06:56:58 (101494) - [INFO] Analysis finished, please check /home/HwHiAiUser/tf/info_20230609065654, you can view README.txt first.

该问题需要您收集msaicerr.py工具分析结果信息(info_时间戳 目录下所有文件),根据这些文件如果无法定位或解决问题,再通过https://gitee.com/ascend网站提交issue获取帮助。