Viewing Results

This part uses the four-node training of the pangu-alpha model as an example to describe how to check whether the resumable training function is normal when a training node is faulty.

  1. Symptom 1: Check whether the model storage path contains a model file with the suffix breakpoint.ckpt. The number of files equals to the number of training nodes multiplied by the number of processors on a single node. In this example, 32 model files are generated, as shown in the following figure.
    Figure 1 Generating model files
  2. Symptom 2: View ModelArts logs to check whether the breakpoint.ckpt model file is successfully loaded. If yes, the following log information is displayed. (The following is the output information of a single process on a node. You can check whether the logs of all nodes contain complete output information. In this example, 32 logs similar to the following are contained.)
    Start to load from /efs/pangu/ckpt_deviceos/rank_0/pangu0-493_2_breakpoint.ckpt
  3. Symptom 3: When a fault occurs, the MindSpore framework captures the exception. Check the ModelArts logs, and the following information is displayed.
    Figure 2 ModelArts logs