完成整网数据dump与溢出检测后,可参考本章节,了解如何进行算子级别的问题定位,仅作为进阶学习。若需快速定位问题,请联系华为工程师进行问题定位,可进入昇腾开源社区使用issue进行沟通。
通过parse接口可以清晰的显示特定API的堆栈信息和dump数据统计信息,结合堆栈信息分析代码中可能存在的精度问题。
from ptdbg_ascend import * # 提取dump信息中第1次调用的API:Torch_batch_normal的堆栈信息及数据统计信息 parse("./stack_dump.pkl", "Torch_batch_normal_1_forward")
python3 parse.py
回显类似如下,用户可根据回显中的Trace back堆栈信息分析可能存在的精度问题:
Statistic Info: [Functional_conv2d_0_forward_input.0][dtype: torch.float32][shape: [1, 3, 244, 244]][max: 4.149261951446533][min: -4.198638916015625][mean: 0.002332142787054181] [Functional_conv2d_0_forward_input.1][dtype: torch.float32][shape: [64, 3, 7, 7]][max: 0.1047200858592987][min: -0.10305210202932358][mean: 0.0004971990711055696] [Functional_conv2d_0_forward_output][dtype: torch.float32][shape: [1, 64, 122, 122]][max: 1.4602006673812866][min: -1.434123158454895][mean: -9.677638445282355e-05] Trace back(Functional_conv2d_0_forward_stack_info): File "npu_jd.py", line 35, in <module> output = model_npu(inputs) File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/python3.7.5/lib/python3.7/site-packages/torchvision/models/resnet.py", line 285, in forward return self._forward_impl(x) File "/usr/local/python3.7.5/lib/python3.7/site-packages/torchvision/models/resnet.py", line 268, in _forward_impl x = self.conv1(x) File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(*input, **kwargs) File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 447, in forward return self._conv_forward(input, self.weight, self.bias) File "/usr/local/python3.7.5/lib/python3.7/site-packages/torch/nn/modules/conv.py", line 444, in _conv_forward self.padding, self.dilation, self.groups) File "/usr/local/python3.7.5/lib/python3.7/site-packages/ptdbg_ascend/hooks/wrap_functional.py", line 59, in functional_op_template return FunctionalOPTemplate(op_name, hook)(*args, **kwargs) File "/usr/local/python3.7.5/lib/python3.7/site-packages/ptdbg_ascend/hooks/module.py", line 70, in __call__ hook_result = hook(self, input, result)
from ptdbg_ascend import register_hook, overflow_check, seed_all, set_dump_path, set_dump_switch, acc_cmp_dump seed_all() ... # dump指定API的算子级别溢出数据 register_hook(model, overflow_check, dump_mode='acl', dump_config='./dump.json') # 在期望溢出检测的step位置开始前打开溢出检测开关 set_overflow_check_switch("ON") ... # 在step结束的位置关闭溢出检测开关 set_overflow_check_switch("OFF") ...
from ptdbg_ascend import register_hook, overflow_check, seed_all, set_dump_path, set_dump_switch, acc_cmp_dump seed_all()... # dump指定反向API的算子级别溢出数据 register_hook(model, acc_cmp_dump, dump_mode='acl', dump_config='./dump.json') set_dump_switch("ON", mode="acl", scope=["Functional_conv2d_1_backward"]) set_backward_input(["acl_dump_xxx//Functional_conv2d_1_backward_input.0.npy"])
dump.json文件配置说明可参考dump.json配置示例。
├── 20230131172437 │ └── 1 │ ├── 0 │ │ ├── Add.Add.45.0.1675157077183551 │ │ ├── Cast.trans_Cast_0.31.0.1675157077159449 │ │ ├── Cast.trans_Cast_5.43.0.1675157077180129 │ │ ├── MatMul.MatMul.39.0.1675157077172961 │ │ ├── Mul.Mul.29.0.1675157077155731 │ │ ├── NPUAllocFloatStatus.NPUAllocFloatStatus.24.0.1675157077145262 │ │ ├── TransData.trans_TransData_1.33.0.1675157077162791 │ │ └── TransData.trans_TransData_4.41.0.1675157077176648 │ ├── 1701737061 │ │ └── Cast.trans_Cast_2.35.0.1675157077166214 │ ├── 25 │ │ └── NPUClearFloatStatus.NPUClearFloatStatus.26.0.1675157077150342 │ └── 68 │ └── TransData.trans_TransData_3.37.0.1675157077169473
cd ${CANN_INSTALL_PATH}/latest/toolkit/tools/operator_cmp/compare
python3 msaccucmp.py convert -d /home/HwHiAiUser/dump -out /home/HwHiAiUser/dumptonumpy -v 2
-d:支持传入单个文件,对单个dump文件进行转换,也支持传入目录,对整个path下所有的dump文件进行转换。
-out:输出文件保存目录。
-v:dump文件的版本,“1”表示protobuf转储文件,“2”表示二进制转储文件,默认值为2。
import numpy as np path = 'path_to_numpy_file' # 文件路径请根据实际情况设置 tensor = np.load(path) print(tensor)