Native Kubernetes Objects
Service Labels
Label |
Description |
Value |
Required Component |
|---|---|---|---|
group-name |
Group name of the acjob corresponding to a pod |
mindxdl.gitee.com |
Volcano, Ascend Operator |
job-name |
Name of the acjob corresponding to a pod |
String |
Ascend Operator |
replica-index |
Index of a pod (to be deleted later) |
[0 – {Number of pods - 1}] |
Ascend Operator |
replica-type |
Type of a pod (to be deleted later) |
|
Ascend Operator |
training.kubeflow.org/job-name |
Name of the acjob corresponding to a pod |
String |
Ascend Operator |
training.kubeflow.org/operator-name |
Name of the operator who creates a pod |
ascendjob-controller |
Ascend Operator |
training.kubeflow.org/replica-index |
Index of a pod |
[0 – {Number of pods - 1}] |
Ascend Operator |
training.kubeflow.org/replica-type |
Type of a pod |
|
Ascend Operator |
Job Labels
Label |
Description |
Value |
Required Component |
|---|---|---|---|
mind-cluster/scaling-rule: scaling-rule |
Name of the ConfigMap of the scaling rule |
String |
Ascend Operator |
mind-cluster/group-name: group0 |
Name of the group of the scaling rule |
String |
Ascend Operator |
Node Labels
Label |
Description |
Value |
Required Component |
|---|---|---|---|
accelerator |
Processor of a node |
huawei-Ascend910, huawei-Ascend310, huawei-Ascend310P |
Ascend Device Plugin |
host-arch |
CPU architecture of a node |
|
Volcano |
masterselector |
Management node of MindCluster |
dls-master-node |
Volcano, Ascend Operator, Resilience Controller, ClusterD |
node.kubernetes.io/npu.chip.name |
Specific type of the current processor |
|
Ascend Device Plugin NOTE:
You can run the npu-smi info command to query the number in the chip model name, which is indicated by the Name field in the returned message. As an example below, the value of {xxx} is 910. |
nodeDEnable |
Whether to enable NodeD |
on |
Volcano, Resilience Controller NOTE:
|
workerselector |
Compute node of MindCluster |
dls-worker-node |
Ascend Device Plugin, NodeD, NPU Exporter |
accelerator-type |
Type of Atlas servers |
|
Ascend Device Plugin, Volcano |
servertype |
Type of an Atlas 200I SoC A1 core board |
|
Volcano, Ascend Device Plugin |
huawei.com/Ascend910-Recover |
Fault identifier of Atlas training product |
ID of the faulty processor |
Ascend Device Plugin |
huawei.com/Ascend910-NetworkRecover |
Network fault identifier of Atlas training product |
ID of the faulty processor |
Ascend Device Plugin |
infer-card-type |
Inference card type of a node, which is written by Ascend Device Plugin |
card-300i-duo |
Volcano |
mind-cluster/npu-chip-memory |
On-chip memory |
mind-cluster/npu-chip-memory=64G |
Volcano, Ascend Device Plugin |
Pod Labels
Label |
Description |
Value |
Required Component |
|---|---|---|---|
ring-controller.atlas |
Pod that identifies "Atlas" |
|
Ascend Device Plugin |
vnpu-dvpp |
DVPP set for a pod |
yes: The pod uses DVPP. no: The pod does not use DVPP. null: default value, indicating whether DVPP is used is not concerned. |
Volcano |
vnpu-level |
Level of the selected virtual instance template |
low (default): low configuration high: performance first |
Volcano |
version |
Version of a pod |
String |
Ascend Operator |
volcano.sh/job-name |
Name of the vcjob corresponding to a pod |
String |
Volcano |
volcano.sh/job-namespace |
vcjob namespace corresponding to a pod |
String |
Volcano |
volcano.sh/queue-name |
Queue name corresponding to a pod |
String |
Volcano |
volcano.sh/task-spec |
Job name corresponding to a pod |
String |
Volcano |
fault-type |
Pod fault handling policy |
|
Volcano |
deploy-name |
Name of the deployment corresponding to a pod |
String |
Ascend Operator |
group-name |
Group name of acjob corresponding to a pod |
mindxdl.gitee.com |
Volcano, Ascend Operator |
job-name |
Name of the acjob corresponding to a pod |
String |
Ascend Operator |
replica-index |
Index of a pod (to be deleted later) |
[0 – {Number of pods - 1}] |
Ascend Operator |
replica-type |
Type of a pod (to be deleted later) |
|
Ascend Operator |
training.kubeflow.org/job-name |
Name of the acjob corresponding to a pod |
String |
Ascend Operator |
training.kubeflow.org/job-role |
Type of a pod |
master |
Ascend Operator |
training.kubeflow.org/operator-name |
Name of the operator who creates a pod. |
ascendjob-controller |
Ascend Operator |
training.kubeflow.org/replica-index |
Index of a pod |
[0 – {Number of pods - 1}] |
Ascend Operator |
training.kubeflow.org/replica-type |
Type of a pod |
|
Ascend Operator |
super-pod-affinity |
SuperPoD affinity scheduling policy |
|
Ascend Operator Volcano |
Pod Annotations
Annotation |
Description |
Value |
Required Component |
|---|---|---|---|
ascend.kubectl.kubernetes.io/ascend-910-configuration |
Data source of hccl.json generated by Ascend Operator |
String in MAP format |
Ascend Device Plugin Ascend Operator |
super_pod_id |
SuperPoD ID information provided to Ascend Operator |
Number |
Ascend Operator |
hccl/rankIndex |
Basis for retaining the original rank ID during resumable training |
[0, 1000] |
Volcano, Ascend Operator |
distributed-job |
Type of training jobs |
|
Volcano |
huawei.com/Ascend910 |
Principles for Ascend Device Plugin to allocate processors to pods |
String |
Volcano, Ascend Device Plugin |
huawei.com/AscendReal |
Record of the processors allocated to pods by Ascend Device Plugin |
String |
Volcano, Ascend Device Plugin |
huawei.com/npu-core |
Physical ID of an NPU and its virtualization template used by pods |
String |
Volcano, Ascend Device Plugin |
huawei.com/kltDev |
Record of the processors allocated to pods by kubelet |
String |
Ascend Device Plugin |
huawei.com/recover_policy_path |
Rescheduling policy |
pod: Only pod-level rescheduling is supported. Rescheduling at the job level is not supported. |
Volcano |
huawei.com/schedule_minAvailable |
Minimum number of replicas that can be scheduled by a job |
Integer |
Volcano |
predicate-time |
Sequential principles for Ascend Device Plugin to allocate processors to pods |
String |
Volcano, Ascend Device Plugin |
isSharedTor |
Switch attribute corresponding to a pod |
Integer |
Volcano |
isHealthy |
Status of the switch corresponding to a pod |
Integer |
Volcano |
scheduling.k8s.io/group-name |
Name of the podGroup corresponding to a pod |
String |
Volcano |
volcano.sh/job-name |
Name of the vcjob corresponding to a pod |
String |
Volcano |
volcano.sh/job-version |
Version of the vcjob corresponding to a pod |
String |
Volcano |
volcano.sh/queue-name |
Version of the queue corresponding to a pod |
String |
Volcano |
volcano.sh/task-spec |
Name of the task corresponding to a pod |
String |
Volcano |
volcano.sh/template-uid |
Name of the pod-template corresponding to a pod |
String |
Volcano |
sharedTorIp |
Information about the shared switch used by a job |
String |
Volcano, ClusterD |
fault-job-delete |
Rank information of a job |
String |
Volcano |
mind-cluster/hardware-type=800I-A2-xx |
xx indicates the on-chip memory of the current node, for example, mind-cluster/hardware-type=800I-A2-64G. |
String |
Volcano |
super-pod-rank |
Logical SuperPoD rank |
Number |
Ascend Operator Volcano |
inHotSwitchFlow |
Hot switching process of pods (faulty and backup pods) |
true |
ClusterD, Ascend Operator |
backupNewPodName |
Name of the backup pod started by the faulty pod |
Name of the backup pod |
ClusterD, Ascend Operator |
backupSourcePodName |
Name of the original pod corresponding to the backup pod |
Name of the original pod |
Ascend Operator |
needOperatorOpe |
Designation of the current pod to be processed by Ascend Operator |
|
ClusterD, Ascend Operator |
needVolcanoOpe |
Designation of the current pod to be processed by Volcano |
delete: Volcano needs to delete the current pod. |
ClusterD, Volcano |
podType |
Backup pod designation |
backup |
ClusterD, Ascend Operator |
Node Annotations
Annotation |
Description |
Value |
Required Component |
|---|---|---|---|
baseDeviceInfos |
Basic processor information, such as the IP address, which is used for Volcano scheduling. |
String |
Volcano |
product-serial-number |
NodeD obtains the node SN through the IPMI and writes the SN to the annotation, which is used by ClusterD to receive public faults. |
String |
ClusterD |
superPodID |
ID of the SuperPoD that a node belongs to. |
String |
ClusterD |
ResetInfo |
Information about the processor that fails to be automatically reset by Ascend Device Plugin, such as its physical ID and card ID. |
String |
Ascend Device Plugin |
Example of the ResetInfo content:
{
"ThirdPartyResetDevs": [
{
"CardId": 0,
"DeviceId": 0,
"AssociatedCardId": 4,
"PhyID": 0,
"LogicID": 0
}
],
"ManualResetDevs": [
{
"CardId": 1,
"DeviceId": 0,
"AssociatedCardId": 5,
"PhyID": 2,
"LogicID": 2
}
]
}
Kubernetes Service Accounts
Account Name |
Description |
|---|---|
volcano-controllers |
Created by the open source volcano-controller. |
volcano-scheduler |
Created by the open source volcano-scheduler. |
ascend-device-plugin-sa-910 |
If you use YAML to start the service, the account is created in Kubernetes. The account name varies depending on the device model. |
ascend-device-plugin-sa-310p |
|
ascend-device-plugin-sa-310 |
|
ascend-operator-manager |
If you use YAML to start the service, the account is created in Kubernetes, for example, ascend-operator-v{version}.yaml. |
resilience-controller |
If you use YAML without tokens to start the service, the account is created in Kubernetes, and proper permissions are granted to the account. Security hardening is recommended. |
noded |
If you use YAML to start the service, the account is created in Kubernetes, for example, noded-v{version}.yaml. |
clusterd |
If you use YAML to start the service, the account is created in Kubernetes, for example, clusterd-v{version}.yaml. |
default |
Automatically created in Kubernetes when MindCluster components or open-source Volcano is deployed. |