Training

The components installed in training scenarios correspond to those in installation and deployment scenarios.

Table 1 Training job scenario

Installation Scenario

Component

Job Type

Operations to Be Performed in a Deployment Job

Description

Full deployment scenario

Ascend Docker Runtime

Ascend Device Plugin

Volcano

HCCL-Controller

NodeD

NPU-Exporter

Resilience-Controller

Elastic-Agent

  • vcjob (Volcano's custom resource object)
  • Deployment (Kubernetes' resource object)

Familiarize yourself with the basic process described in NPU Training Job. Then, try the subsequent sections, for example, delivering NPU training jobs using the CLI or by programming. Finally, integrate advanced features.

For details about the functions provided in each scenario, see the description during installation and deployment.

Cluster scheduling scenario

Ascend Docker Runtime

Ascend Device Plugin

Volcano

(Optional) HCCL-Controller

(Optional) NodeD

(Optional) NPU-Exporter

Device management scenario

Ascend Docker Runtime

Ascend Device Plugin (The startup parameter volcanoType is set to false.)

(Optional) NPU-Exporter

  • Job (Kubernetes' resource object)
  • Deployment (Kubernetes' resource object)
  • Other resource types