昇腾社区首页
中文
注册

ABI配置常见问题

问题描述

  • 现象1

    当编译链接atb或mki时遇到报错undefined reference to "std::__cxx11 ***"

    fffec0150000-fffec0160000 rw-p 214a70000 08:11 42336262	/datal/models/Lite-32K/SFT-V4.0-20240621-INT8/SFT-13B v20240621 longbase ep3.data
    fffec0160000-fffec0170000 rw-p e8d90000 08:11 42336262	/datal/models/Lite-32K/SFT-V4.0-20240621-INT8/SFT-13B v20240621 longbase ep3.data
    fffec0170000-fffec0180000 rw-p 1a03c0000 08:11 42336262	/datal/models/Lite-32K/SFT-V4.0-20240621-INT8/SFT-13B v20240621 longbase_ep3.data
    fffec0180000-fffec01a0000 rw-p 1a03b0000 08:11 42336262	/datal/models/Lite-32K/SFT-V4.0-20240621-INT8/SFT-13B v20240621 longbase ep3.data
    fffec01a0000-fffec01c0000 rw-p 1b8530000 08:11 42336262	/datal/models/Lite-32K/SFT-V4.0-20240621-INT8/SFT-13B v20240621 longbase ep3.data
    Program received signal SIGABRT, Aborted.
    [Switching to Thread 0xfff7aa3a1990 (LWP 41895)]
    0x0000ffffbc3850e8 in raise () from /Lib64/libc.so.6
    Missing separate debuginfos, use: debuginfo-install glibc-2.17-260.el7.aarch64 zlib-1.2.7-18.el7.aarch64
    (gdb) bt
    #0  0x0000ffffbc3850e8 in raise () from /lib64/libc.s0.6
    #1  0x0000ffffoc386760 in abort () from /1ib64/libc.so.6
    #2  0x0000ffffbc3c5048 in __Libc_message () from /Lib64/Libc.so.6
    #3  0x0000ffffbc3cd58c in _int free () from /lib64/libc.so.6
    #4  0x0000ffffadd85c08 in std::__detail::_Compiler<std::__cxx11::regex_traits<char> >::_Compiler(char const*, std::locale const&, std::tegex_con
        stants::syntax option type) () from /usr/local/Ascend/nnal/atb/latest/atb/cxx abi 1/lib/libmki.so
    #5  0x0000ffffadd7450c in Mki::LogSinkFile::DeleteOldestFile() () from /usr/local/Ascend/nnal/atb/latest/atb/cxx abi 1/lib/libmki.so
    #6  0x0000ffffadd74df4 in Mki::LogSinkFile::OpenFile() () from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libmki.so
    #7	0x0000ffffadd75080 in Mki::LogSinkFile::Log(char const*, unsigned long) () from /usr/local/Ascend/nnal/atb/latest/atb/cxx abi_1/lib/libmki.so
    #8  0x000Offffadd6f4e4 in Mki::LogCore::Log(char const*, unsigned long) () from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libmki.so
    #9  0x0000ffffadd863Oc in Mki::LogStream::~LogStream() () from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libmki.so
    #10 0x0000ffffb0494abO in atb::SelfAttentionOperation::PAMaskDimCheckNz(atb::SVector<atb::TensorDesc> const&) const ()
        from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libatb.so
    #11 0x0000ffffb0496984 in atb::SelfAttentionOperation::InferShapePADimCheck(atb::SVector<atb::TensorDesc> const&) const ()
        from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libatb.so
    #12 0x0000ffffb0498ae0 in atb::SelfAttentionOperation::SetupCheckImpl(atb::SVector<atb::Tensor> const&, atb::SVector<atb::Tensor> const&) const
        from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libatb.so
    #13 0x0000ffffb04e77ac in atb::OperationBase::SetupCheck(atb::VariantPack const&, atb::Context*) ()
        from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libatb.so
    #14 0x0000ffffb04eca4c in atb::OperationBase::Setup(atb::VariantPack const&, unsigned long&, atb::Context*) ()
        from /usr/local/Ascend/nnal/atb/latest/atb/cxx_abi_1/lib/libatb.so

  • 现象2

    执行atb时出现undefined symbol:***cxx11***

    python {source_code_path}/ascend-transformer-boost/tests/apitest/torch_atb_test/op_test/linear_parallel/matmul_allreduce_pertoken_test.py
    Traceback (most recent call last):
      File "{python_install_path}/site-packages/torch_atb/__init__.py", line 25, in _load_atb_libs
        ctypes.CDLL(str(atb_lib_path / lib_file), mode=ctypes.RTLD_GLOBAL)
      File "{python_install_path}/ctypes/__init__.py", line 374, in __init__
        self._handle = _dlopen(self._name, mode)
    OSError: {CANN_nnal_install_path}/nnal/atb/latest/atb/cxx_abi_1/lib/libcann_ops_adapter.so: undefined symbol: _ZN3Mki6Status10FailStatusEiRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
    
    The above exception was the direct cause of the following exception:
    
    Traceback (most recent call last):
      File "{source_code_path}/ascend-transformer-boost/tests/apitest/torch_atb_test/op_test/linear_parallel/matmul_allreduce_pertoken_test.py", line 12, in <module>
        import torch_atb
      File "{python_install_path}/site-packages/torch_atb/__init__.py", line 35, in <module>
        _load_atb_libs()
      File "{python_install_path}/site-packages/torch_atb/__init__.py", line 27, in _load_atb_libs
        raise RuntimeError(f"Failed to load {lib_file}: {err}") from err
    RuntimeError: Failed to load libasdops.so: {CANN_nnal_install_path}/nnal/atb/latest/atb/cxx_abi_1/lib/libcann_ops_adapter.so: undefined symbol: _ZN3Mki6Status10FailStatusEiRKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
    [ERROR] 2025-07-28-11:58:16 (PID:1974512, Device:-1, RankID:-1) ERR99999 UNKNOWN application exception
  • 现象3

    使用带有string类型成员的param创建atb operation时出错,日志:atb create operation failed

    [2025-07-28 14:24:11.340266] [error] [2063924] [all_reduce_operation.cpp:86] param rsv has a non-zero value, please check the compilation version.
    [2025-07-28 14:24:11.340266] [error] [2063923] [all_reduce_operation.cpp:86] param rsv has a non-zero value, please check the compilation version.
    all_reduce_demo.cpp:58 [error]: all_reduce_demo.cpp:58 [error]: 11
    
    atb create operation failedatb create operation failed
    
    all_reduce_demo.cpp:137 [error]: 1
    multithread task 0 failed
    all_reduce_demo.cpp:137 [error]: 1
    multithread task 1 failed

原因分析

  • 上述示例均为在ABI0版本的框架中尝试使用ABI1编译或执行ATB相关功能时报错。
  • 不同版本的ABI(如旧ABI和C++11 ABI)对std::string的实现方式有所不同。旧ABI(C++98/03)在内存管理和小字符串优化(SSO)有独特实现,而C++11 ABI改进了内存分配和共享内存的管理。

这些差异可能导致不同ABI版本编译的程序在传递std::string对象时发生内存冲突、崩溃或链接错误。当前ATB针对ABI版本的配置逻辑为:ABI配置方式

解决措施

  • 环境中有torch
    # 使用torch查看当前框架对应的ABI版本
    echo $(python3 -c 'import torch; print(torch.compiled_with_cxx11_abi())')
    # 若打印结果为true
    source ${nnal install path}/nnal/atb/set_env.sh --cxx_abi=1
    # 或
    source ${nnal install path}/nnal/atb/set_env.sh
    # 若打印结果为false
    source ${nnal install path}/nnal/atb/set_env.sh --cxx_abi=0
    # 或
    source ${nnal install path}/nnal/atb/set_env.sh
  • 环境中不需要torch
    • 无法确定应使用什么ABI版本,需自行判断,默认设置参考ABI配置方式
    • 可尝试使用与当前设置相反的ABI配置。