Native Kubernetes Objects

Service Labels

**Table 1** Service labels used in cluster scheduling components
Label	Description	Value	Required Component
group-name	Group name of the acjob corresponding to a pod	mindxdl.gitee.com	Volcano, Ascend Operator
job-name	Name of the acjob corresponding to a pod	String	Ascend Operator
replica-index	Index of a pod (to be deleted later)	[0 – {Number of pods - 1}]	Ascend Operator
replica-type	Type of a pod (to be deleted later)	master chief scheduler worker	Ascend Operator
training.kubeflow.org/job-name	Name of the acjob corresponding to a pod	String	Ascend Operator
training.kubeflow.org/operator-name	Name of the operator who creates a pod	ascendjob-controller	Ascend Operator
training.kubeflow.org/replica-index	Index of a pod	[0 – {Number of pods - 1}]	Ascend Operator
training.kubeflow.org/replica-type	Type of a pod	master chief scheduler worker	Ascend Operator

Job Labels

**Table 2** Job labels used in cluster scheduling components
Label	Description	Value	Required Component
mind-cluster/scaling-rule: scaling-rule	Name of the ConfigMap of the scaling rule	String	Ascend Operator
mind-cluster/group-name: group0	Name of the group of the scaling rule	String	Ascend Operator

Node Labels

**Table 3** Node labels used in cluster scheduling components
Label	Description	Value	Required Component
accelerator	Processor of a node	huawei-Ascend910, huawei-Ascend310, huawei-Ascend310P	Ascend Device Plugin
host-arch	CPU architecture of a node	huawei-x86 huawei-arm	Volcano
masterselector	Management node of MindCluster	dls-master-node	Volcano, Ascend Operator, Resilience Controller, ClusterD
node.kubernetes.io/npu.chip.name	Specific type of the current processor	310 310P1 310P2 310P3 310P4 {xxx}A 910PremiumA 910ProA 910ProB {xxx}Bx (x can be 1, 2, 3, or 4)	Ascend Device Plugin NOTE: You can run the npu-smi info command to query the number in the chip model name, which is indicated by the Name field in the returned message. As an example below, the value of {xxx} is 910.
nodeDEnable	Whether to enable NodeD	on	Volcano, Resilience Controller NOTE: nodeDEnable=on indicates that the node status monitoring function of NodeD is enabled to obtain node status information and determine whether a node is faulty. If the value is off or this parameter does not exist, only node information is reported, and whether the node is faulty is not determined. This label does not need to be configured when the function of containerization or resource monitoring is used. This label is mandatory for other features.
workerselector	Compute node of MindCluster	dls-worker-node	Ascend Device Plugin, NodeD, NPU Exporter
accelerator-type	Type of Atlas servers	card module half module-{xxx}b-8 module-{xxx}b-16 card-{xxx}b-2 card-{xxx}b-infer module-a3-16 module-a3-16-super-pod	Ascend Device Plugin, Volcano
servertype	Type of an Atlas 200I SoC A1 core board	soc Ascend910-{Number of AI Cores} Ascend 310P-{Number of AI Cores}	Volcano, Ascend Device Plugin
huawei.com/Ascend910-Recover	Fault identifier of Atlas training product	ID of the faulty processor	Ascend Device Plugin
huawei.com/Ascend910-NetworkRecover	Network fault identifier of Atlas training product	ID of the faulty processor	Ascend Device Plugin
infer-card-type	Inference card type of a node, which is written by Ascend Device Plugin	card-300i-duo	Volcano
mind-cluster/npu-chip-memory	On-chip memory	mind-cluster/npu-chip-memory=64G	Volcano, Ascend Device Plugin

Pod Labels

**Table 4** Pod labels used in cluster scheduling components
Label	Description	Value	Required Component
ring-controller.atlas	Pod that identifies "Atlas"	ascend-910 ascend-{xxx}b	Ascend Device Plugin
vnpu-dvpp	DVPP set for a pod	yes: The pod uses DVPP. no: The pod does not use DVPP. null: default value, indicating whether DVPP is used is not concerned.	Volcano
vnpu-level	Level of the selected virtual instance template	low (default): low configuration high: performance first	Volcano
version	Version of a pod	String	Ascend Operator
volcano.sh/job-name	Name of the vcjob corresponding to a pod	String	Volcano
volcano.sh/job-namespace	vcjob namespace corresponding to a pod	String	Volcano
volcano.sh/queue-name	Queue name corresponding to a pod	String	Volcano
volcano.sh/task-spec	Job name corresponding to a pod	String	Volcano
fault-type	Pod fault handling policy	SubHealth Separate	Volcano
deploy-name	Name of the deployment corresponding to a pod	String	Ascend Operator
group-name	Group name of acjob corresponding to a pod	mindxdl.gitee.com	Volcano, Ascend Operator
job-name	Name of the acjob corresponding to a pod	String	Ascend Operator
replica-index	Index of a pod (to be deleted later)	[0 – {Number of pods - 1}]	Ascend Operator
replica-type	Type of a pod (to be deleted later)	master chief scheduler worker	Ascend Operator
training.kubeflow.org/job-name	Name of the acjob corresponding to a pod	String	Ascend Operator
training.kubeflow.org/job-role	Type of a pod	master	Ascend Operator
training.kubeflow.org/operator-name	Name of the operator who creates a pod.	ascendjob-controller	Ascend Operator
training.kubeflow.org/replica-index	Index of a pod	[0 – {Number of pods - 1}]	Ascend Operator
training.kubeflow.org/replica-type	Type of a pod	master chief scheduler worker	Ascend Operator
super-pod-affinity	SuperPoD affinity scheduling policy	soft hard	Ascend Operator Volcano

Pod Annotations

**Table 5** Pod annotations used in cluster scheduling components
Annotation	Description	Value	Required Component
ascend.kubectl.kubernetes.io/ascend-910-configuration	Data source of hccl.json generated by Ascend Operator	String in MAP format	Ascend Device Plugin Ascend Operator
super_pod_id	SuperPoD ID information provided to Ascend Operator	Number	Ascend Operator
hccl/rankIndex	Basis for retaining the original rank ID during resumable training	[0, 1000]	Volcano, Ascend Operator
distributed-job	Type of training jobs	true: distributed job false: single-server job	Volcano
huawei.com/Ascend910	Principles for Ascend Device Plugin to allocate processors to pods	String	Volcano, Ascend Device Plugin
huawei.com/AscendReal	Record of the processors allocated to pods by Ascend Device Plugin	String	Volcano, Ascend Device Plugin
huawei.com/npu-core	Physical ID of an NPU and its virtualization template used by pods	String	Volcano, Ascend Device Plugin
huawei.com/kltDev	Record of the processors allocated to pods by kubelet	String	Ascend Device Plugin
huawei.com/recover_policy_path	Rescheduling policy	pod: Only pod-level rescheduling is supported. Rescheduling at the job level is not supported.	Volcano
huawei.com/schedule_minAvailable	Minimum number of replicas that can be scheduled by a job	Integer	Volcano
predicate-time	Sequential principles for Ascend Device Plugin to allocate processors to pods	String	Volcano, Ascend Device Plugin
isSharedTor	Switch attribute corresponding to a pod	Integer	Volcano
isHealthy	Status of the switch corresponding to a pod	Integer	Volcano
scheduling.k8s.io/group-name	Name of the podGroup corresponding to a pod	String	Volcano
volcano.sh/job-name	Name of the vcjob corresponding to a pod	String	Volcano
volcano.sh/job-version	Version of the vcjob corresponding to a pod	String	Volcano
volcano.sh/queue-name	Version of the queue corresponding to a pod	String	Volcano
volcano.sh/task-spec	Name of the task corresponding to a pod	String	Volcano
volcano.sh/template-uid	Name of the pod-template corresponding to a pod	String	Volcano
sharedTorIp	Information about the shared switch used by a job	String	Volcano, ClusterD
fault-job-delete	Rank information of a job	String	Volcano
mind-cluster/hardware-type=800I-A2-xx	xx indicates the on-chip memory of the current node, for example, mind-cluster/hardware-type=800I-A2-64G.	String	Volcano
super-pod-rank	Logical SuperPoD rank	Number	Ascend Operator Volcano
inHotSwitchFlow	Hot switching process of pods (faulty and backup pods)	true	ClusterD, Ascend Operator
backupNewPodName	Name of the backup pod started by the faulty pod	Name of the backup pod	ClusterD, Ascend Operator
backupSourcePodName	Name of the original pod corresponding to the backup pod	Name of the original pod	Ascend Operator
needOperatorOpe	Designation of the current pod to be processed by Ascend Operator	create: Ascend Operator needs to create a backup pod based on the current pod. delete: Ascend Operator needs to delete the current pod.	ClusterD, Ascend Operator
needVolcanoOpe	Designation of the current pod to be processed by Volcano	delete: Volcano needs to delete the current pod.	ClusterD, Volcano
podType	Backup pod designation	backup	ClusterD, Ascend Operator

Node Annotations

**Table 6** Node annotations used in cluster scheduling components
Annotation	Description	Value	Required Component
baseDeviceInfos	Basic processor information, such as the IP address, which is used for Volcano scheduling.	String	Volcano
product-serial-number	NodeD obtains the node SN through the IPMI and writes the SN to the annotation, which is used by ClusterD to receive public faults.	String	ClusterD
superPodID	ID of the SuperPoD that a node belongs to.	String	ClusterD
ResetInfo	Information about the processor that fails to be automatically reset by Ascend Device Plugin, such as its physical ID and card ID.	String	Ascend Device Plugin

Example of the ResetInfo content:

{
    "ThirdPartyResetDevs": [
        {
            "CardId": 0,
            "DeviceId": 0,
            "AssociatedCardId": 4,
            "PhyID": 0,
            "LogicID": 0
        }
    ],
    "ManualResetDevs": [
        {
            "CardId": 1,
            "DeviceId": 0,
            "AssociatedCardId": 5,
            "PhyID": 2,
            "LogicID": 2
        }
    ]
}

Kubernetes Service Accounts

**Table 7** List of service accounts created by each component in Kubernetes
Account Name	Description
volcano-controllers	Created by the open source volcano-controller.
volcano-scheduler	Created by the open source volcano-scheduler.
ascend-device-plugin-sa-910	If you use YAML to start the service, the account is created in Kubernetes. The account name varies depending on the device model.
ascend-device-plugin-sa-310p
ascend-device-plugin-sa-310
ascend-operator-manager	If you use YAML to start the service, the account is created in Kubernetes, for example, ascend-operator-v{version}.yaml.
resilience-controller	If you use YAML without tokens to start the service, the account is created in Kubernetes, and proper permissions are granted to the account. Security hardening is recommended.
noded	If you use YAML to start the service, the account is created in Kubernetes, for example, noded-v{version}.yaml.
clusterd	If you use YAML to start the service, the account is created in Kubernetes, for example, clusterd-v{version}.yaml.
default	Automatically created in Kubernetes when MindCluster components or open-source Volcano is deployed.

Parent topic: API Reference