ABI配置常见问题
问题描述
- 现象1
当编译链接atb或mki时遇到报错undefined reference to "std::__cxx11 ***"
fffec0150000-fffec0160000 rw-p 214a70000 08:11 42336262 /datal/models/Lite-32K/SFT-V4.0-20240621-INT8/SFT-13B v20240621 longbase ep3.data fffec0160000-fffec0170000 rw-p e8d90000 08:11 42336262 /datal/models/Lite-32K/SFT-V4.0-20240621-INT8/SFT-13B v20240621 longbase ep3.data fffec0170000-fffec0180000 rw-p 1a03c0000 08:11 42336262 /datal/models/Lite-32K/SFT-V4.0-20240621-INT8/SFT-13B v20240621 longbase_ep3.data fffec0180000-fffec01a0000 rw-p 1a03b0000 08:11 42336262 /datal/models/Lite-32K/SFT-V4.0-20240621-INT8/SFT-13B v20240621 longbase ep3.data fffec01a0000-fffec01c0000 rw-p 1b8530000 08:11 42336262 /datal/models/Lite-32K/SFT-V4.0-20240621-INT8/SFT-13B v20240621 longbase ep3.data Program received signal SIGABRT, Aborted. [Switching to Thread 0xfff7aa3a1990 (LWP 41895)] 0x0000ffffbc3850e8 in raise () from /Lib64/libc.so.6 Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7.aarch64 zlib-1.2.7-18.el7.aarch64 (gdb) bt #0 0x0000ffffbc3850e8 in raise () from /lib64/libc.s0.6 #1 0x0000ffffoc386760 in abort () from /1ib64/libc.so.6 #2 0x0000ffffbc3c5048 in __Libc_message () from /Lib64/Libc.so.6 #3 0x0000ffffbc3cd58c in _int free () from /lib64/libc.so.6 #4 0x0000ffffadd85c08 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_Compiler(char const*, std::locale const&, std::tegex_con stants::syntax option type) () from /usr/local/Ascend/nnal/atb/latest/atb/cxx abi 1/lib/libmki.so #5 0x0000ffffadd7450c in Mki::LogSinkFile::DeleteOldestFile() () from /usr/local/Ascend/nnal/atb/latest/atb/cxx abi 1/lib/libmki.so #6 0x0000ffffadd74df4 in Mki::LogSinkFile::OpenFile() () from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libmki.so #7 0x0000ffffadd75080 in Mki::LogSinkFile::Log(char const*, unsigned long) () from /usr/local/Ascend/nnal/atb/latest/atb/cxx abi_1/lib/libmki.so #8 0x000Offffadd6f4e4 in Mki::LogCore::Log(char const*, unsigned long) () from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libmki.so #9 0x0000ffffadd863Oc in Mki::LogStream::~LogStream() () from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libmki.so #10 0x0000ffffb0494abO in atb::SelfAttentionOperation::PAMaskDimCheckNz(atb::SVector<atb::TensorDesc> const&) const () from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libatb.so #11 0x0000ffffb0496984 in atb::SelfAttentionOperation::InferShapePADimCheck(atb::SVector<atb::TensorDesc> const&) const () from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libatb.so #12 0x0000ffffb0498ae0 in atb::SelfAttentionOperation::SetupCheckImpl(atb::SVector<atb::Tensor> const&, atb::SVector<atb::Tensor> const&) const from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libatb.so #13 0x0000ffffb04e77ac in atb::OperationBase::SetupCheck(atb::VariantPack const&, atb::Context*) () from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libatb.so #14 0x0000ffffb04eca4c in atb::OperationBase::Setup(atb::VariantPack const&, unsigned long&, atb::Context*) () from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libatb.so
- 现象2
执行atb时出现undefined symbol:***cxx11***
python {source_code_path}/ascend-transformer-boost/tests/apitest/torch_atb_test/op_test/linear_parallel/matmul_allreduce_pertoken_test.py Traceback (most recent call last): File "{python_install_path}/site-packages/torch_atb/__init__.py", line 25, in _load_atb_libs ctypes.CDLL(str(atb_lib_path / lib_file), mode=ctypes.RTLD_GLOBAL) File "{python_install_path}/ctypes/__init__.py", line 374, in __init__ self._handle = _dlopen(self._name, mode) OSError: {CANN_nnal_install_path}/nnal/atb/latest/atb/cxx_abi_1/lib/libcann_ops_adapter.so: undefined symbol: _ZN3Mki6Status10FailStatusEiRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE The above exception was the direct cause of the following exception: Traceback (most recent call last): File "{source_code_path}/ascend-transformer-boost/tests/apitest/torch_atb_test/op_test/linear_parallel/matmul_allreduce_pertoken_test.py", line 12, in <module> import torch_atb File "{python_install_path}/site-packages/torch_atb/__init__.py", line 35, in <module> _load_atb_libs() File "{python_install_path}/site-packages/torch_atb/__init__.py", line 27, in _load_atb_libs raise RuntimeError(f"Failed to load {lib_file}: {err}") from err RuntimeError: Failed to load libasdops.so: {CANN_nnal_install_path}/nnal/atb/latest/atb/cxx_abi_1/lib/libcann_ops_adapter.so: undefined symbol: _ZN3Mki6Status10FailStatusEiRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE [ERROR] 2025-07-28-11:58:16 (PID:1974512, Device:-1, RankID:-1) ERR99999 UNKNOWN application exception
- 现象3
使用带有string类型成员的param创建atb operation时出错,日志:atb create operation failed
[2025-07-28 14:24:11.340266] [error] [2063924] [all_reduce_operation.cpp:86] param rsv has a non-zero value, please check the compilation version. [2025-07-28 14:24:11.340266] [error] [2063923] [all_reduce_operation.cpp:86] param rsv has a non-zero value, please check the compilation version. all_reduce_demo.cpp:58 [error]: all_reduce_demo.cpp:58 [error]: 11 atb create operation failedatb create operation failed all_reduce_demo.cpp:137 [error]: 1 multithread task 0 failed all_reduce_demo.cpp:137 [error]: 1 multithread task 1 failed
原因分析
- 上述示例均为在ABI0版本的框架中尝试使用ABI1编译或执行ATB相关功能时报错。
- 不同版本的ABI(如旧ABI和C++11 ABI)对std::string的实现方式有所不同。旧ABI(C++98/03)在内存管理和小字符串优化(SSO)有独特实现,而C++11 ABI改进了内存分配和共享内存的管理。
这些差异可能导致不同ABI版本编译的程序在传递std::string对象时发生内存冲突、崩溃或链接错误。当前ATB针对ABI版本的配置逻辑为:ABI配置方式。
解决措施
- 环境中有torch
# 使用torch查看当前框架对应的ABI版本 echo $(python3 -c 'import torch; print(torch.compiled_with_cxx11_abi())') # 若打印结果为true source ${nnal install path}/nnal/atb/set_env.sh --cxx_abi=1 # 或 source ${nnal install path}/nnal/atb/set_env.sh # 若打印结果为false source ${nnal install path}/nnal/atb/set_env.sh --cxx_abi=0 # 或 source ${nnal install path}/nnal/atb/set_env.sh
- 环境中不需要torch
父主题: 故障案例