Job Progress Viewing
After a training job is successfully delivered, it can run properly. You can view the running status of the training job as follows.
Viewing All Training Jobs
To view all training jobs running on the current node, perform the following steps:
- Log in to the management node and go to the directory where the YAML file is stored.
- Run the following command to check the running status of the training job:
kubectl get pods -A -o wide
A sample command output is as follows:1 2 3 4
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default default-test-pytorch-master-0 1/1 Running 0 5s xxx.xxx.xxx.xxx node1 <none> <none> default default-test-pytorch-worker-0 1/1 Running 0 5s xxx.xxx.xxx.xxx node2 <none> <none> ...
Viewing the Training Job on a Single Pod
To view the training job running on a single pod, perform the following steps:
Run the following command to check the running status of the training job:
kubectl logs default-test-pytorch-worker-0 -n default -f
A sample command output is as follows. If "loss" is displayed, the job is running properly.

Checking Whether Checkpoint Files Exist
The fault recovery function is implemented based on checkpoint files. You need to check whether the checkpoint files exist on the storage node.
Allow the training job to run beyond the checkpoint saving interval, then verify that periodic checkpoint files appear in the specified path. The procedure is as follows:
- Log in to the storage node and go to the checkpoint file path.
cd /data/atlas_dls/public/code/QWEN3_for_PyTorch_2.7_code/output/ckpt
- Check whether periodic checkpoint files exist in the current directory.
ll ./
If the following information is displayed, periodic checkpoint files exist.1 2 3
total 8 drwx-xr-x- 18 root root 8192 Jun 22 18:39 iter_0000100 -rw-r--r-- 1 root root 2 Jun 22 18:39 latest_checkpointed_iteration.txt
- (Optional) If dying gasp is used, run the following command in the path where the checkpoint is saved to check whether the dying gasp checkpoint file exists in the current directory.
ll ./
If the following information is displayed, the dying gasp checkpoint file exists.1 2 3
total 8 drwx-xr-x- 18 root root 8192 Jun 22 15:39 iter_0000009 -rw-r--r-- 1 root root 2 Jun 22 15:39 latest_checkpointed_iteration.txt
Parent topic: Using Resumable Training on the CLI