Feature Introduction

The following describes the features supported in the installation and deployment scenarios.

NPU Device Management

Supports NPU device discovery and status monitoring based on the Kubernetes device plug-in mechanism.

NPU Scheduling Optimization

Selects proper NPUs based on the physical topology of NPUs to maximize NPU performance.

Resumable Training

When an NPU or a server is faulty, a training job is automatically rescheduled to a healthy NPU device or node to continue the training job.

Rescheduling Upon Inference Card Faults

When an NPU is faulty, an inference job is automatically rescheduled to a healthy device to continue the inference job.

Minimum Service System

When an NPU or a server is faulty, a training job is automatically rescheduled and a healthy device is used to continue the training job.