Core Dump Occurs When the Model Is Trained Using the glibc 2.17 Image

Symptom

When a recommendation model is trained in the environment where the base image is CentOS 7.6, the following stack problem occurs.

Possible Causes

When glibc 2.17 processes the Thread Local Storage (TLS), a large number of dlopen, dlclose, and pthread_create are executed concurrently, which may cause the _dl_allocate_tls_init segment error. For details, click here to see the root cause, test code, and fix code.

Solution

  1. This issue has been resolved in glibc 2.34. You are advised to use glibc 2.34 or later to train models.
  2. Fix the glibc in the training environment by referring to the fix code.
  3. Run the export LD_PRELOAD=/usr/lib64/libstdc++.so.6 command.