Constraints

  • Ensure that the time of all nodes in the Kubernetes cluster is the same to prevent incorrect program determination.
  • When the Resilience-Controller is rescheduling a job, it does not handle the new fault of the job.
  • In scenarios where cluster resources are limited, if multiple jobs are faulty at the same time and rescheduling is triggered, the jobs may be pending due to insufficient resources.
  • NFS requires users to isolate directories as required. The random read and write performance of NFS must be able to store complete CKPT files within 15 minutes. You are advised to use professional storage servers. The specific performance requirements are as follows:

Dependency Configuration

  • This feature supports only the detection of device faults and server network faults. For device faults, see the DCMI-reported critical and major errors in the Ascend 910 Black Box Error Code Information List, as well as the device network faults detected by the device network detection tool (hccn_tool). Server network faults depend on the NodeD heartbeat mechanism. If NodeD is incorrectly installed or the network between nodes is disconnected, the fault detection function is affected.
  • This feature does not apply to virtual instance scenarios.
  • This feature depends on the Resilience-Controller, Volcano, HCCL-Controller, NodeD, and Ascend Device Plugin components in all cluster scheduling components. To use the dying gasp function, the mindx_elastic binary file is also required.
  • Currently, this feature supports only distributed vcjob training of the parallel and hybrid data types between servers.
  • Add fault-scheduling: grace (fault rescheduling switch), elastic-scheduling, and minReplicas (minimum nodes required by a job) to the YAML file of a vcjob. For details about their values, see Job Configuring.
  • For the dying gasp function, set terminationGracePeriodSeconds in the YAML file of a vcjob. For details, see Table 2.
  • Set the maxRetry field to 0 in the YAML file of the vcjob.
  • Node fault rule: After sending a heartbeat message, if NodeD does not send another heartbeat message within a period of time (greater than the threshold of the interval between two heartbeat messages), Resilience-Controller and Volcano consider that the node where NodeD is located is faulty and trigger rescheduling. If the interval between two heartbeat messages sent by NodeD is less than or equal to the threshold, Resilience-Controller and Volcano consider the node where NodeD is located.

    The calculation formula is as follows: Threshold of the interval between two heartbeat messages = Interval for sending heartbeat messages x 3 (3 indicates that Resilience-Controller and Volcano retry three times.)

    • SigTerm and SigInt are used in the dying gasp function.
    • The dying gasp function does not support encryption and decryption of saved checkpoint files.