Feature Introduction

The following content describes the features supported in the installation and deployment scenarios.

NPU Device Management

Supports NPU device discovery and status monitoring based on the Kubernetes device plug-in mechanism.

NPU Scheduling Optimization

Selects proper NPUs based on the physical topology of NPUs to maximize NPU performance.

Resumable Training

When an NPU or a server is faulty, a training job is automatically rescheduled to a healthy NPU device or node to continue the training job.

Rescheduling Upon Inference Card Faults

When an NPU is faulty, an inference job is automatically rescheduled to a healthy device to continue the inference job.

Minimum Service System

When an NPU or a server is faulty, a training job is automatically rescheduled and a healthy device is used to continue the training job.