Kubernetes Identifier Description
Node Label |
Description |
Value |
Required Component |
|---|---|---|---|
accelerator |
Processor of a node. |
huawei-Ascend910, huawei-Ascend310, and huawei-Ascend310P |
Ascend Device Plugin |
host-arch |
CPU architecture of a node. |
huawei-x86 and huawei-arm |
Volcano |
masterselector |
Master node of MindX DL. |
dls-master-node |
Volcano, HCCL-Controller, and Resilience-Controller |
nodeDEnable |
Whether to enable NodeD. |
on |
Volcano and Resilience-Controller |
workerselector |
Worker node of MindX DL. |
dls-worker-node |
Ascend Device Plugin, NodeD, and NPU-Exporter |
accelerator-type |
Type of the Atlas training server. |
card, module, and half |
Ascend Device Plugin and Volcano |
servertype |
Atlas 200I SoC A1 core board. |
soc |
Volcano |
huawei.com/Ascend910-Recover |
Faulty Ascend 910 AI Processor. |
ID of the faulty processor |
Ascend Device Plugin |
huawei.com/Ascend910-NetworkUnhealthyRecover |
Network of an Ascend 910 AI Processor recovers. |
ID of the faulty processor |
Ascend Device Plugin |
Node Annotation |
Description |
Value |
Required Component |
|---|---|---|---|
noded/heartbeat |
NodeD heartbeat that indicates whether a node is healthy. |
string |
Volcano, NodeD, and Resilience-Controller |
noded/heartbeat-interval |
NodeD heartbeat interval. |
string |
Volcano, NodeD, and Resilience-Controller |
Package |
Function |
Value |
Required Component |
|---|---|---|---|
ring-controller.atlas |
An Atlas PoD. |
ascend-910 |
Ascend Device Plugin and HCCL-Controller |
fault-scheduling |
Whether to enable fault rescheduling. |
grace, force, and off |
Volcano and Resilience-Controller |
elastic-scheduling |
Whether to enable job elastic scheduling. |
on |
Resilience-controller |
Name |
Function |
Value |
Required Component |
|---|---|---|---|
ascend.kubectl.kubernetes.io/ascend-910-configuration |
Data source of hccl.json generated by HCCL-Controller |
String in MAP format |
Ascend Device Plugin and HCCL-Controller |
hccl/rankIndex |
Basis for retaining the original rank ID during resumable training. |
[0,1000] |
Volcano and HCCL-Controller |
huawei.com/Ascend910 |
Basis for Ascend Device Plugin to allocate processors to PoDs |
String |
Volcano and Ascend Device Plugin |
huawei.com/AscendReal |
Records of the processors allocated to pods by Ascend Device Plugin. |
String |
Volcano and Ascend Device Plugin |
huawei.com/kltDev |
Records of the processors allocated to PoDs by kubelet. |
String |
Ascend Device Plugin |
predicate-time |
Sequence for Ascend Device Plugin to allocate processors to PoDs. |
String |
Volcano and Ascend Device Plugin |
ConfigMap |
Namespace |
Description |
Required Component |
|---|---|---|---|
vcjob-fault-npu-cm |
volcano-system |
Fixes fault rescheduling content. |
Volcano |
volcano-scheduler-configmap |
volcano-system |
Volcano-Scheduler configuration file (native) |
Volcano |
mindx-dl-deviceinfo-node name |
kube-system |
Processor information on a node reported by Ascend Device Plugin |
Volcano, Ascend Device Plugin, and Resilience-Controller |
Fault-config-job name |
job space |
Information about faulty rank IDs required for resumable training |
Volcano and Elastic-Agent |
Rings-config-job name |
job space |
hccl.json content |
Ascend Device Plugin and HCCL-Controller |