Feature Introduction
The following content describes the features supported in the installation and deployment scenarios.
NPU Device Management
Supports NPU device discovery and status monitoring based on the Kubernetes device plug-in mechanism.
NPU Scheduling Optimization
Selects proper NPUs based on the physical topology of NPUs to maximize NPU performance.
Resumable Training
When an NPU or a server is faulty, a training job is automatically rescheduled to a healthy NPU device or node to continue the training job.
Rescheduling Upon Inference Card Faults
When an NPU is faulty, an inference job is automatically rescheduled to a healthy device to continue the inference job.
Minimum Service System
When an NPU or a server is faulty, a training job is automatically rescheduled and a healthy device is used to continue the training job.
Parent topic: Supported Installation Scenarios