Kubernetes Identifier Description

Table 1 Node labels used in cluster scheduling components

Node Label

Description

Value

Required Component

accelerator

Processor of a node.

huawei-Ascend910, huawei-Ascend310, and huawei-Ascend310P

Ascend Device Plugin

host-arch

CPU architecture of a node.

huawei-x86 and huawei-arm

Volcano

masterselector

Master node of MindX DL.

dls-master-node

Volcano, HCCL-Controller, and Resilience-Controller

nodeDEnable

Whether to enable NodeD.

on

Volcano and Resilience-Controller

workerselector

Worker node of MindX DL.

dls-worker-node

Ascend Device Plugin, NodeD, and NPU-Exporter

accelerator-type

Type of the Atlas training server.

card, module, and half

Ascend Device Plugin and Volcano

servertype

Atlas 200I SoC A1 core board.

soc

Volcano

huawei.com/Ascend910-Recover

Faulty Ascend 910 AI Processor.

ID of the faulty processor

Ascend Device Plugin

huawei.com/Ascend910-NetworkUnhealthyRecover

Network of an Ascend 910 AI Processor recovers.

ID of the faulty processor

Ascend Device Plugin

Table 2 Node annotations used in cluster scheduling components

Node Annotation

Description

Value

Required Component

noded/heartbeat

NodeD heartbeat that indicates whether a node is healthy.

string

Volcano, NodeD, and Resilience-Controller

noded/heartbeat-interval

NodeD heartbeat interval.

string

Volcano, NodeD, and Resilience-Controller

Table 3 Pod labels used in cluster scheduling components

Package

Function

Value

Required Component

ring-controller.atlas

An Atlas PoD.

ascend-910

Ascend Device Plugin and HCCL-Controller

fault-scheduling

Whether to enable fault rescheduling.

grace, force, and off

Volcano and Resilience-Controller

elastic-scheduling

Whether to enable job elastic scheduling.

on

Resilience-controller

Table 4 Pod annotations used in cluster scheduling components

Name

Function

Value

Required Component

ascend.kubectl.kubernetes.io/ascend-910-configuration

Data source of hccl.json generated by HCCL-Controller

String in MAP format

Ascend Device Plugin and HCCL-Controller

hccl/rankIndex

Basis for retaining the original rank ID during resumable training.

[0,1000]

Volcano and HCCL-Controller

huawei.com/Ascend910

Basis for Ascend Device Plugin to allocate processors to PoDs

String

Volcano and Ascend Device Plugin

huawei.com/AscendReal

Records of the processors allocated to pods by Ascend Device Plugin.

String

Volcano and Ascend Device Plugin

huawei.com/kltDev

Records of the processors allocated to PoDs by kubelet.

String

Ascend Device Plugin

predicate-time

Sequence for Ascend Device Plugin to allocate processors to PoDs.

String

Volcano and Ascend Device Plugin

Table 5 ConfigMaps used in cluster scheduling components

ConfigMap

Namespace

Description

Required Component

vcjob-fault-npu-cm

volcano-system

Fixes fault rescheduling content.

Volcano

volcano-scheduler-configmap

volcano-system

Volcano-Scheduler configuration file (native)

Volcano

mindx-dl-deviceinfo-node name

kube-system

Processor information on a node reported by Ascend Device Plugin

Volcano, Ascend Device Plugin, and Resilience-Controller

Fault-config-job name

job space

Information about faulty rank IDs required for resumable training

Volcano and Elastic-Agent

Rings-config-job name

job space

hccl.json content

Ascend Device Plugin and HCCL-Controller