Feature Introduction
If a training node managed by a cluster scheduling component is faulty (for example, the network or chip of the node where the Ascend AI processor is installed and NodeD is enabled is faulty), the cluster scheduling component isolates the faulty node and resets the number of job copies based on the preset job scale and the number of available nodes in the current cluster, then, rescheduling and retraining are performed (script adaptation is required).
Parent topic: Example of Minimum Service System