Error: "Segmentation fault"

Symptom

After the migrated code is executed, no error is reported but the message "Segmentation fault" is displayed.

Possible Cause

Possible Cause 1:

TensorBoard is referenced in the code or third-party libraries in use contain TensorBoard. The following are known third-party libraries that reference TensorBoard:

  • wandb: If this library is used only for printing logs, you can delete the call to it.
  • transformers: This library is deeply bound to TensorFlow and TensorBoard.

Possible Cause 2:

The training script contains code for comparing two 0-dimensional tensors on different devices. The comparison cannot run on torch_npu currently.

Solution

Solution for Cause 1:

Comment out the related Summary and Writer calls to avoid this error. Summary and Writer are mainly used to record logs and draw diagrams, which do not affect network running and precision convergence.

Solution for Cause 2:

Add python -X faulthandler before the script startup command to print thread information, locate the error, and use PDB for debugging. Check whether the script contains code for comparing two 0-dimensional tensors on different devices. If such code is found, manually change it to compare two 0-dimensional tensors on the same device. Here is an example:

Before the modification, the comparison is performed on the CPU and NPU:

a = torch.tensor(123)
b = torch.tensor(456).npu()
print(a == b)

After the modification, the information in bold is added to compare the tensors only on the NPU:

a = torch.tensor(123).npu()
b = torch.tensor(456).npu()
print(a == b)