Cluster Scheduling Scenario (Full-Stack)

Scenario

This applies to the scenario where you have one or more NPU servers and need to use Kubernetes for management. In addition, features such as NPU Device Management, NPU Scheduling Optimization, Resumable Training, and Rescheduling Upon Inference Card Faults can be used. In this scenario, Docker, Kubernetes, and NPU cluster scheduling components are installed on the NPU server.

List of Components to Be Installed

Table 1 Components to be installed in the cluster scheduling scenario (full-stack)

Component

Description

Docker

Container engine

K8s

Container orchestration system

Ascend Docker Runtime

Allows containers to use Ascend NPUs.

Ascend Device Plugin

Supports NPU device management.

Volcano

Supports NPU scheduling optimization, resumable training, and rescheduling upon inference card faults.

HCCL-Controller

Generates the ranktable file (also called the hccl.json file) for NPU training jobs.

NodeD

Supports resumable training (upon node faults).

NPU-Exporter

Supports monitoring of NPU device management status.