Core Dump Occurs on the Application Process Due to the Calling of AscendCL Deinitialization API aclFinalize in the Destructor Call
Symptom
A core dump occurs during the running of an application, and the application stops abnormally.
Possible Cause
- Generate a core dump file.
- On a physical machine, run the ulimit -c unlimited command. A core dump file can be generated when the program breaks down.
If you do not need to generate a core dump file after locating the fault, run the ulimit -c 0 command.
- In a Docker, add the --ulimit core=-1 setting to the Docker startup command.
- On a physical machine, run the ulimit -c unlimited command. A core dump file can be generated when the program breaks down.
- Run the application. If the process breaks down, a core dump file is generated in the current directory.
- Use the GDB tool to debug the core file and print stack information.
Enter the GDB mode and debug the core dump file. An example command is as follows. main indicates the name of the executable application that generates the core dump file, which can be changed as required. The core dump file name needs to be replaced with the actual name.
gdb main core*.*
After the command is executed, the GDB tool prints to the screen the code where the exception occurred, its corresponding function, file name, and line number. The top of the stack information contains information about the bottom-level call stack, which is convenient for fault locating. The following is an example of stack information.
Thread 1 "main" received signal SIGSEGV, Segmentation fault. 0x0000ffffa70747c8 in ge::PluginManager::~PluginManager() () from /usr/local/Ascend/latest/lib64/libge_common.so (gdb) bt #0 0x0000ffffa70747c8 in ge::PluginManager::~PluginManager() () from /usr/local/Ascend/latest/lib64/libge_common.so #1 0x0000ffffa707c900 in ge::RuntimePluginLoader::Finalize() () from /usr/local/Ascend/latest/lib64/libge_common.so #2 0x0000ffffa29485d0 in ge::GeExecutor::FinalizeEx() () from /usr/local/Ascend/latest/lib64/libge_executor.so #3 0x0000ffffb06fabc in aclFinalize() from /usr/local/Ascend/latest/lib64/libascendcl.so #4 0x0000ffffbd5a98ec in ResourceManager::~ResourceManager() () from /home/miniconda3/envs/gly/lib/pythons3.7/site-packages/mindspore/_c_dataengine.cpython-37m-aarch64-linux-gnu.so #5 0x0000ffffbd5a9f80 in std::Sp_counted_ptr<ResourceManager*, (__gnu_cxx::Lock_policy)2>::_M_dispose() () from /home/miniconda3/envs/gly/lib/pythons3.7/site-packages/mindspore/_c_dataengine.cpython-37m-aarch64-linux-gnu.so #6 0x0000ffffbd5a97f0 in std::shared_ptr<ResourceManager>::~shared_ptr() () from /home/miniconda3/envs/gly/lib/pythons3.7/site-packages/mindspore/_c_dataengine.cpython-37m-aarch64-linux-gnu.so
Note that debugging the core dump file and printing stack information should be done in the operating environment where the exception occurred. If you switch to a different environment, the debugged stack information may be inaccurate.
If the GDB tool is not installed in the environment, install it using a package manager (such as apt-get install gdb and yum install gdb). For details about the installation procedure and usage, see the GDB official document.
- Analyze the stack information.
After the core dump file is generated and the printed stack information is checked, it is found that the application exits abnormally when the aclFinalize API is called. Therefore, it can be preliminarily determined that there may be an issue with the use of the aclFinalize API, which is used to deinitialize AscendCL.
- Check the logic for calling of the aclFinalize API in the application code.
Check the code logic. It is found that the aclFinalize API is called in the destructor call, but the API has the following restrictions: You are advised not to call aclFinalize in the destructor call. Otherwise, the process may exit abnormally due to the unknown singleton destruction sequence. Therefore, it can be determined that core dump occurs for the application because the aclFinalize API is called in the destructor call.
Solution
Optimize the code logic of the application. Do not call the aclFinalize API in the destructor call. The following provides correct and incorrect code examples.
- The following is an example of correctly calling the aclFinalize API:
int main() { // Initialization // .. indicates the directory relative to the directory of the executable file. For example, if the executable file is stored in the out directory, .. indicates the upper-level directory of the out directory. const char *aclConfigPath = "../src/acl.json"; aclError ret = aclInit(aclConfigPath); // Service processing code // Deinitialization. The main function is not exited, and all resources are available. ret = aclFinalize(); return 0; }
- The following is an example of incorrectly calling the aclFinalize API. That is, the singleton destruction is used for deinitialization.
class ResourceManager { public: ResourceManager() = default; // Singleton destruction ~ResourceManager() { // Deinitialization (void) aclFinalize(); } // Singleton construction static ResourceManager &Instance() { static ResourceManager instance; return instance; } aclError Init() { // Initialization // .. indicates the directory relative to the directory of the executable file. For example, if the executable file is stored in the out directory, .. indicates the upper-level directory of the out directory. const char *aclConfigPath = "../src/acl.json"; return aclInit(aclConfigPath); } }; int main() { // Initialization aclError ret = ResourceManager::Instance().Init(); // Service processing code // No explicit deinitialization. aclFinalize is called during singleton destruction of ResourceManager. // The singleton destruction is executed after the main function exits. Therefore, the unloading sequence of dependency SO files of the singleton destruction and process cannot be controlled. // The SO files including some resources accessed by aclFinalize have been uninstalled. As a result, the process exits abnormally. return 0; }
- The following is an example of incorrectly calling the aclFinalize API. That is, the global variable destruction is used for deinitialization.
class ResourceManager { public: ResourceManager() = default; // Global variable destruction ~ResourceManager() { // Deinitialization (void) aclFinalize(); } aclError Init() { // Initialization // .. indicates the directory relative to the directory of the executable file. For example, if the executable file is stored in the out directory, .. indicates the upper-level directory of the out directory. const char *aclConfigPath = "../src/acl.json"; return aclInit(aclConfigPath); } }; // Global variable construction ResourceManager g_resource_manager; int main() { // Initialization aclError ret = g_resource_manager.Init(); // Service processing code // No explicit deinitialization. aclFinalize is called during global variable destruction of ResourceManager. // The global variable destruction is executed after the main function exits. Therefore, the unloading sequence of dependency SO files of the global variable destruction and process cannot be controlled. // The SO files including some resources accessed by aclFinalize have been uninstalled. As a result, the process exits abnormally. return 0; }