Cluster Scheduling Scenario

Scenario

In this scenario, you already have a Kubernetes cluster and need to manage new NPU servers. In addition, features such as NPU Device Management, NPU Scheduling Optimization, Resumable Training, and Rescheduling Upon Inference Card Faults can be used. You need to deploy the NPU management component on the master node of the existing Kubernetes cluster and deploy the NPU management component of the worker node on the newly managed NPU server.

List of Components to Be Installed

Table 1 Components to be installed in the cluster scheduling scenario

Component

Function

Ascend Docker Runtime

Allows containers to use Ascend NPUs.

Ascend Device Plugin

Supports NPU device management.

Volcano

Supports NPU scheduling optimization, resumable training, and rescheduling upon inference card faults.

(Optional) HCCL-Controller

Generates the ranktable file (also called the hccl.json file) for NPU training jobs. Install it when you need to use the function.

(Optional) NodeD

Supports resumable training upon node faults. Install it when you need to use the function.

(Optional) NPU-Exporter

Supports monitoring of NPU device management status. Install it when you need to use the function.

Component Deployment Modes

Installation Procedure

  1. Obtain the software packages for installing the components. For details, see Software Package Description.
  2. Install the Ascend Docker Runtime. For details, see Installing the Ascend Docker Runtime.
  3. Perform operations based on the component deployment mode.
    • If the components are deployed in containers, create images for the components by referring to Creating an Image.
    • If the Ascend Device Plugin and NPU-Exporter are deployed using binary files, skip this step.
  4. Create a user on the nodes where the components are deployed. For details, see Creating a User Account.
  5. Create a log directory on the nodes where the components are deployed. For details, see Creating a Log Directory.
  6. (Optional) Import the certificate required by the NPU-Exporter to start the HTTPS service. For details, see Importing a Certificate and KubeConfig File. If the HTTP service is started, skip this step.
  7. (Optional) Import the KubeConfig file used by the HCCL-Controller, NodeD, Resilience-Controller and Ascend Device Plugin to connect to Kubernetes by referring to Importing a Certificate and KubeConfig File. If the ServiceAccount is used, skip this step.
  8. Create a namespace in Kubernetes. For details, see Creating a Namespace.
  9. Label the nodes. For details, see Creating a Node Label.
  10. Start each component. For details, see Common Operations.