Constraints
- Only the Kubernetes cluster that uses cluster scheduling components is supported. In addition, ensure that the time of each node in the Kubernetes cluster is the same to avoid misjudgment.
- NFS requires users to isolate directories as required. The random read and write performance of NFS must be able to store complete CKPT files within 15 minutes. You are advised to use professional storage servers. The specific performance requirements are as follows:
- Resumable training supports only the detection of device faults and server network faults. For device faults, see the DCMI-reported critical and major errors in the Ascend 910 Black Box Error Code Information List, as well as the device network faults detected by the device network detection tool (hccn_tool). Server network faults depend on the NodeD heartbeat mechanism. If NodeD is incorrectly installed or the network between nodes is disconnected, the fault detection function is affected.
- This feature does not apply to virtual instance scenarios.
- This feature depends on the Volcano, HCCL-Controller, NodeD, and Ascend Device Plugin components of cluster scheduling components. To use the dying gasp function, the mindx_elastic binary file is also required.
- Currently, this feature supports only vcjobs.
- Add the fault-scheduling switch to the YAML file of a vcjob. For details, see Table 1.
- For the dying gasp function, set terminationGracePeriodSeconds in the YAML file of a vcjob. For details, see Table 2.
- Configure the retry mechanism maxRetry in the YAML file of a vcjob. For details, see YAML Parameters.
- Volcano is responsible for checking node faults. After sending a heartbeat message, if NodeD does not send another heartbeat message within a period of time (greater than the threshold of the interval between two heartbeat messages), Volcano considers that the node where NodeD is located is faulty and triggers rescheduling. If the interval between two heartbeat messages sent by NodeD is less than or equal to the threshold, Volcano considers that the node where NodeD is located recovers.
The calculation formula is as follows: Threshold of the interval between two heartbeat messages = Interval for sending heartbeat messages x 3 (3 indicates that Volcano retries three times.)
Parent topic: MindX DL Scenario
