Full Deployment Scenario
Scenario
In this scenario, you already have a Kubernetes cluster and want to use all features of cluster scheduling components, such as NPU Device Management, NPU Scheduling Optimization, Resumable Training (dying gasp), Rescheduling Upon Inference Card Faults, and Minimum Service System. You need to deploy all cluster scheduling components in the existing Kubernetes cluster.
List of Components to Be Installed
Component |
Function |
|---|---|
Ascend Docker Runtime |
Allows containers to use Ascend NPUs. |
Ascend Device Plugin |
Supports NPU device management. |
Volcano |
Supports NPU scheduling optimization, resumable training, and rescheduling upon inference card faults. |
HCCL-Controller |
Generates the ranktable file (also called the hccl.json file) for NPU training jobs. |
NodeD |
Supports resumable training (upon node fault). |
NPU-Exporter |
Supports monitoring of NPU device management status. |
Resilience-Controller |
Supports the minimum service system. |
Elastic-Agent |
Supports the dying gasp function of resumable training. |
Component Deployment Modes
- For details about the component installation positions, see Component Installation Positions.
- Both the Ascend Device Plugin and NPU-Exporter support container deployment and binary deployment. For details about the differences, see Differences Between Container and Binary Deployment.
- The HCCL-Controller, NodeD, and Resilience-Controller are deployed in containers. When connecting to Kubernetes, you can use the ServiceAccount or the KubeConfig file for authentication. For details about the differences between the two modes, see Differences Between ServiceAccount and KubeConfig.
- The NPU-Exporter can provide the HTTP or HTTPS service during startup. For details about the differences between the two services, see Differences Between HTTP and HTTPS.
- The Ascend Docker Runtime is installed using the .run package. For details about how to obtain the .run package, see Software Package Description.
- Volcano is deployed in containers.
Installation Procedure
- Obtain the software packages for installing the components. For details, see Software Package Description.
- Install the Ascend Docker Runtime. For details, see Installing the Ascend Docker Runtime.
- Perform operations based on the component deployment mode.
- If the components are deployed in containers, create images for the components by referring to Creating an Image.
- If the Ascend Device Plugin and NPU-Exporter are deployed using binary files, skip this step.
- Create a user on the nodes where the components are deployed. For details, see Creating a User Account.
- Create a log directory on the nodes where the components are deployed. For details, see Creating a Log Directory.
- (Optional) Import the certificate required by the NPU-Exporter to start the HTTPS service. For details, see Importing a Certificate and KubeConfig File. If the HTTP service is started, skip this step.
- (Optional) Import the KubeConfig file used by the HCCL-Controller, NodeD, Resilience-Controller and Ascend Device Plugin to connect to Kubernetes by referring to Importing a Certificate and KubeConfig File. If the ServiceAccount is used, skip this step.
- Create a namespace in Kubernetes. For details, see Creating a Namespace.
- Label the nodes. For details, see Creating a Node Label.
- Start each component. For details, see Common Operations.