Viewing Results
This part uses the four-node training of the pangu-alpha model as an example to describe how to check whether the resumable training function is normal when a training node is faulty.
- Symptom 1: Check whether the model storage path contains a model file with the suffix breakpoint.ckpt. The number of files equals to the number of training nodes multiplied by the number of processors on a single node. In this example, 32 model files are generated, as shown in the following figure.Figure 1 Generating model files

- Symptom 2: View ModelArts logs to check whether the breakpoint.ckpt model file is successfully loaded. If yes, the following log information is displayed. (The following is the output information of a single process on a node. You can check whether the logs of all nodes contain complete output information. In this example, 32 logs similar to the following are contained.)
Start to load from /efs/pangu/ckpt_deviceos/rank_0/pangu0-493_2_breakpoint.ckpt
- Symptom 3: When a fault occurs, the MindSpore framework captures the exception. Check the ModelArts logs, and the following information is displayed.Figure 2 ModelArts logs

Parent topic: ModelArts Scenario