AOE(Ascend Optimization Engine)是一款自动调优工具,通过生成调优策略、编译和在运行环境上验证的闭环反馈机制,不断迭代出更优的调优策略,最终得到最佳的调优策略,从而可以更充分利用硬件资源,不断提升网络的性能,达到最优的效果。
如图1,通过summary表格能看到每个shape的PMU数据。在例子图中,可以看到基本都是MTE2bound,mte2-ratio达到99%以上,mac-ratio则都在63%左右。针对mac-ratio未达到70%以上的case,则认为存在优化的空间,可以使用AOE工具进行调优;请联系华为工程师进行优化。
当前调优支持 |
FP16 |
BF16 |
FP32 |
INT8(全量化) |
---|---|---|---|---|
Tensorflow(.pb) |
支持 |
支持 |
支持 |
待测试 |
Pytorch(.txt) |
支持 |
支持 |
支持 |
待测试 |
MindSpore(.air/proto*.txt) |
待测试 |
待测试 |
待测试 |
待测试 |
bash run.sh --case_file=./aicore_test/msopst/msopst_case.xlsx --sheet_name=matmul_v2 --fusion_file=op_template/fusion_mm_gen_pb_model.py --case_type=stc --host_process_count=1 --testcase_id=case1
原则上需要在模型开头添加以下内容以生成dump的txt文件。
torch_npu.npu.set_aoe('./dump_path')
source /usr/local/Ascend/latest/bin/setenv.bash
torch_npu.npu.set_aoe('./dump_path')
python import torch import torch_npu import numpy as np torch_npu.npu.set_compile_mode(jit_compile=True) # 当前aoe仅在静态时生成dump图 torch_npu.npu.set_aoe('./dump_path') def case1(): input_shape_0 = (1024,10240) input_shape_1 = (5120,10240) fmap = torch.from_numpy(np.ones(input_shape_0).astype(np.float16)) fmap_npu = fmap.npu() weight = torch.from_numpy(np.ones(input_shape_1).astype(np.float16)) weight_npu = weight.npu() attn_output_npu = torch.mm(fmap_npu, weight_npu.t()) print("successed!") case1()
import torch import torch_npu import numpy as np torch_npu.npu.set_compile_mode(jit_compile=True) torch_npu.npu.set_aoe('./1119ms') def case1(): input_shape_0 = (2048,12288) input_shape_1 = (12288,6144) fmap = torch.from_numpy(np.ones(input_shape_0)) fmap = fmap.to(torch.float16) fmap_npu = fmap.npu() weight = torch.from_numpy(np.ones(input_shape_1)) weight = weight.to(torch.float16) weight_npu = weight.npu() attn_output_npu = torch.mm(fmap_npu, weight_npu) print("successed!") def case2(): input_shape_0 = (2048,12288) input_shape_1 = (1536,12288) fmap = torch.from_numpy(np.ones(input_shape_0)) fmap = fmap.to(torch.float16) fmap_npu = fmap.npu() weight = torch.from_numpy(np.ones(input_shape_1)) weight = weight.to(torch.float16) weight_npu = weight.npu() attn_output_npu = torch.mm(fmap_npu, weight_npu.t()) print("successed!") def case3(): input_shape_0 = (2048,12288) input_shape_1 = (2048,6144) fmap = torch.from_numpy(np.ones(input_shape_0)) fmap = fmap.to(torch.float16) fmap_npu = fmap.npu() weight = torch.from_numpy(np.ones(input_shape_1)) weight = weight.to(torch.float16) weight_npu = weight.npu() attn_output_npu = torch.mm(fmap_npu.t(), weight_npu) print("successed!") case1() case2() case3()
-- dump_path ---- ge_proto_00000_MatMul1.txt ---- ge_proto_00001_MatMul3.txt ---- ge_proto_00002_MatMul5.txt ---- ge_proto_00003_MatMul7.txt
source /usr/local/Ascend/latest/bin/setenv.bash source /usr/local/Ascend/latest/tools/aoe/bin/setenv.bash
export REPEAT_TUNE=True
export TUNE_BANK_PATH=./matmul_bank
source /usr/local/Ascend/latest/bin/setenv.bash source /usr/local/Ascend/latest/tools/aoe/bin/setenv.bash export REPEAT_TUNE=True # 执行调优流程 rm -rf ./matmul_bank mkdir ./matmul_bank export TUNE_BANK_PATH=./matmul_bank # 配置知识库输出路径 aoe -m case10_model.pb -f 3 -j 2
source /usr/local/Ascend/latest/bin/setenv.bash source /usr/local/Ascend/latest/tools/aoe/bin/setenv.bash export REPEAT_TUNE=True # 执行调优流程 rm -rf ./matmul_bank mkdir ./matmul_bank export TUNE_BANK_PATH=./matmul_bank # 配置知识库输出路径 aoe --precision_mode=must_keep_origin_dtype -m case10_model.pb -f 3 -j 2
source /usr/local/Ascend/latest/bin/setenv.bash source /usr/local/Ascend/latest/tools/aoe/bin/setenv.bash export REPEAT_TUNE=True # 执行调优流程 rm -rf ./matmul_bank mkdir ./matmul_bank export TUNE_BANK_PATH=./matmul_bank # 配置知识库输出路径 aoe --job_type=2 --model_path=./dump_path/
source /usr/local/Ascend/latest/bin/setenv.bash source /usr/local/Ascend/latest/tools/aoe/bin/setenv.bash export REPEAT_TUNE=True # 执行调优流程 rm -rf ./matmul_bank mkdir ./matmul_bank export TUNE_BANK_PATH=./matmul_bank # 配置知识库输出路径 aoe --job_type=2 --precision_mode=must_keep_origin_dtype --model_path=./dump_path/
export DUMP_GE_GRAPH=1 export DUMP_GRAPH_LEVEL=4
source /usr/local/Ascend/latest/bin/setenv.bash source /usr/local/Ascend/latest/tools/aoe/bin/setenv.bash export REPEAT_TUNE=True # 执行调优流程 rm -rf ./matmul_bank mkdir ./matmul_bank export TUNE_BANK_PATH=./matmul_bank #配置知识库输出路径 aoe --job_type=2 --precision_mode=must_keep_origin_dtype --model_path=./dump_path/ #torch txt bf16 fp32
在配置知识库输出路径下,找到./AscendXX/AscendXX_24_AiCore_MatMulV2_runtime_kb.json中存在对应知识库(参考)则说明成功。
{"id":1550289820,"info_dict":{"a_dtype":1,"a_format":2,"aub_double_num":1.0,"b_dtype":1,"b_format":2,"batch_a1":1,"batch_a2":1,"batch_a3":1,"batch_a4":1,"batch_b1":1,"batch_b2":1,"batch_b3":1,"batch_b4":1,"bias_flag":false,"bub_double_num":1.0,"fused_double_operand_num":0.0,"k":8000,"k_align_flag":true,"l1_fused_num":0.0,"m":4096,"m_align_flag":true,"n":10240,"n_align_flag":true,"out_dtype":1,"out_format":2,"reserved_bool":false,"reserved_params1":2,"reserved_params2":0,"reserved_params3":0,"reserved_params4":0,"reserved_params5":0,"reserved_params6":0,"trans_a_flag":false,"trans_b_flag":false},"knowledge":{"batch_aub":1,"batch_bub":1,"batch_cub":1,"batch_dim":1,"batch_l0":1,"db_al1":2,"db_aub":2,"db_bl1":2,"db_bub":2,"db_cub":2,"db_l0a":2,"db_l0b":2,"db_l0c":1,"k_al1":16,"k_aub":8192,"k_bl1":8,"k_bub":9,"k_dim":1,"k_l0":4,"m_al1":1,"m_aub":12,"m_dim":8,"m_l0":16,"n_bl1":1,"n_bub":10240,"n_cub":1,"n_dim":3,"n_l0":8},"op":"MatMulV2","version":0}
cd /usr/local/Ascend/CANN-7.x/opp/built-in/data/op/
./AscendXX/unified_bank/AscendXX_24_AiCore_MatMulV2_runtime_kb.json