amp_C Cannot Be Found When a Training Job Using the PyTorch Framework Is Executed
Symptom
After the watchdog function is enabled, a training job of the PyTorch framework is delivered, and an error message is displayed indicating that amp_C cannot be found.
Cause Analysis
The megatron_npu path cannot be found in the image.
Solution
Add the following environment variables in bold in train_start.sh to set the environment variables of megatron_npu in Python.
... # env for breakpoint ckpt export RESUME_MODE_ENABLE=1 export PYTHONPATH=$PYTHONPATH:MEGATRON_LMpath /megatron_npu
Parent topic: Faults During Use