amp_C Cannot Be Found When a Training Job Using the PyTorch Framework Is Executed

Symptom

After the watchdog function is enabled, a training job of the PyTorch framework is delivered, and an error message is displayed indicating that amp_C cannot be found.

Cause Analysis

The megatron_npu path cannot be found in the image.

Solution

Add the following environment variables in bold in train_start.sh to set the environment variables of megatron_npu in Python.
...
# env for breakpoint ckpt
export RESUME_MODE_ENABLE=1

export PYTHONPATH=$PYTHONPATH:MEGATRON_LMpath /megatron_npu