Feature Description
- To use this feature, you must enable NodeD monitoring on the node where faults are to be monitored. For details about the configuration method, see NodeD Monitoring Configuration on a Node.
- The number of copies of a new job ranges from the number of minReplicas to that of replicas. The specific value is determined by the number of available nodes in the current cluster. This parameter is valid for multi-node distributed training.
- When the rescheduling policy is enabled, an exception of the Ascend Device Plugin also triggers rescheduling upon a fault.
- It is recommended that the IP address used to check the connectivity between NPU chips be set to the IP address of the router.
- If a fault occurs when a single-node system with multiple devices is used for training, the original job specifications are preferentially used for restoration, and the job specifications comply with the restoration policy of 8/4/2/1 devices.
- The following table lists the system specifications supported by the minimum service system.
Table 1 System specifications Type
Configuration
Server
Atlas 800 training server (model 9000) (full configuration of NPUs)
Training framework
MindSpore/TensorFlow/PyTorch. Only the MindSpore framework supports the dying gasp feature of resumable training.
Parent topic: Example of Minimum Service System