YAML Parameters
For acjob, understand the YAML parameters before configuring the YAML file. For details, see Table 1.
Each acjob YAML file contains some fixed fields, such as apiVersion and kind. For more information about these fields, see Key Fields in acjob.
Parameter |
Value |
Description |
|---|---|---|
(.kind=="AscendJob").metadata.labels.framework |
|
Framework type. Currently, only three types are supported. |
(.kind=="AscendJob").metadata.labels.jobID |
Unique ID of the MindIE Motor job in the cluster. Set this parameter as required. |
This parameter can be used only on the Atlas 800I A3 SuperPoD Server and Atlas 800I A2 inference server. |
(.kind=="AscendJob").metadata.labels.app |
Role of the MindIE Motor task in acjob (Ascend job). The value can be mindie-ms-controller, mindie-ms-coordinator, or mindie-ms-server. |
NOTE:
|
(.kind=="AscendJob").metadata.labels.mind-cluster/scaling-rule: scaling-rule |
Name of the ConfigMap of the scaling rule. |
This parameter can be used only for MindIE Motor inference jobs on the Atlas 800I A3 SuperPoD Server and Atlas 800I A2 inference server. |
(.kind=="AscendJob").metadata.labels.mind-cluster/group-name: group0 |
Name of the group of the scaling rule. |
This parameter can be used only for MindIE Motor inference jobs on the Atlas 800I A3 SuperPoD Server and Atlas 800I A2 inference server. |
(.kind=="AscendJob").metadata.labels."ring-controller.atlas" |
|
Processor type for specified products. You need to set this parameter in the ConfigMap and task. |
(.kind=="AscendJob").metadata.labels.tor-affinity |
NOTE: You need to select a job type based on the number of job replicas. If the number of job replicas is less than 4, the job is a padding job. If the number of job replicas is greater than or equal to 4, the job is a foundation model job. The number of replicas of a common job is not limited. |
The default value is null, indicating that switch affinity scheduling is not used. You need to set this parameter based on the job type. NOTE:
|
(.kind=="AscendJob").metadata.labels.pod-rescheduling |
|
For pod-level rescheduling, if a job is faulty, the system does not delete all pods of the job. Instead, the system deletes the faulty pods, creates new pods, and reschedules the pods. NOTE:
|
(.kind=="AscendJob").metadata.labels.process-recover-enable |
|
Ascend Operator automatically adds the process-recover-enable=on label to the job based on the configured recover-strategy. You do not need to manually specify the label. |
(.kind=="AscendJob").metadata.annotations.recover-strategy |
Available recovery policy.
|
recover-strategy is configured in annotations of the job YAML file. The value can be any combination of the six strategies. Use commas (,) to separate the strategies. |
(.kind=="AscendJob").metadata.labels.subHealthyStrategy |
|
Processing policy for nodes in the SubHealthy status. NOTE:
|
(.kind=="AscendJob").specs.schedulerName |
The default value is volcano. Set this parameter based on your actual requirements. |
Scheduler selected when Ascend Operator enables gang scheduling. |
(.kind=="AscendJob").spec.runPolicy.schedulingPolicy.minAvailable |
The default value is the total number of job replicas. |
Total number of job replicas when Ascend Operator enables gang scheduling and Volcano is used as the scheduler. |
(.kind=="AscendJob").spec.runPolicy.schedulingPolicy.queue |
The default value is default. Set this parameter based on your actual requirements. |
Queue to which a job belongs when Ascend Operator enables gang scheduling and Volcano is used as the scheduler. |
(Optional) (.kind=="AscendJob").spec.successPolicy |
|
Prerequisite for a successful job. An empty value indicates that if only one pod succeeds, the entire job is considered successful. AllWorkers indicates that all pods need to succeed for the job to be considered as successful. |
(.kind=="AscendJob").spec.replicaSpecs.[Master|Scheduler|Worker].template.spec.containers[0].name |
ascend |
The container name must be ascend. |
(Optional) (.kind=="AscendJob").spec.replicaSpecs.{Master|Scheduler|Worker}.template.spec.containers[0].ports |
If you do not set corresponding parameters, the system fills in the following values by default:
|
Collective communication port for distributed training. The value of name can only be ascendjob-port. You can set containerPort as required. If containerPort is not set, the default port 2222 is used. |
(.kind=="AscendJob").spec.replicaSpecs.{Master|Scheduler|Worker}.replicas |
|
N indicates the number of job replicas. |
(.kind=="AscendJob").spec.replicaSpecs.{Master|Scheduler|Worker}.template.spec.containers[0].image |
- |
Training image name. Set this parameter as required. |
(.kind=="AscendJob").spec.replicaSpecs.{Master|Scheduler|Worker}.template.spec. nodeSelector.host-arch |
Arm environment: huawei-arm x86_64 environment: huawei-x86 |
Architecture of the node where a training job is executed. Set this parameter as required. In a distributed training job, ensure that the nodes running the training job have the same architecture. |
(.kind=="AscendJob").spec.replicaSpecs.{Master|Scheduler|Worker}.template.spec. nodeSelector.accelerator-type |
|
Set this parameter based on the type of the node where a training job is executed.
NOTE:
You can run the npu-smi info command to query the number in the chip model name, which is indicated by the Name field in the returned message. As an example below, the value of {xxx} is 910. |
(.kind=="AscendJob").metadata.annotations."huawei.com/schedule_policy" |
See Table 2 for its configurations. |
Job's AI processor layout to be scheduled. Volcano selects a proper scheduling policy based on this field. If this parameter is not set, the scheduling policy is selected based on accelerator-type. NOTE:
This field can be used only on the Atlas training product, |
(.kind=="AscendJob").metadata.annotations.sp-block |
Number of processors on logical SuperPoDs.
|
Cluster scheduling components divide logical SuperPoDs on physical SuperPoDs based on the division policy for affinity scheduling of training jobs. If this field is not specified, Volcano sets the size of the logical SuperPoD of a job to the total number of NPUs configured for the job during scheduling. For details, see UnifiedBus Interconnect Device Network Description. NOTE:
|
(.kind=="AscendJob").spec.replicaSpecs.{Master|Scheduler|Worker}.template.spec.containers[0].resources.requests."huawei.com/Ascend910" |
Atlas 800 training server (fully populated with NPUs):
Atlas 800 training server (half populated with NPUs):
Server (with Atlas 300T training cards):
Atlas 800T A2 training server and Atlas 900 A2 PoD cluster basic unit:
Atlas 200T A2 Box16 heterogeneous subrack:
Atlas 900 A3 SuperPoD, A200T A3 Box8 SuperPoD Server, and Atlas 800T A3 SuperPoD Server:
|
Number of requested NPUs. Set this parameter as required. |
(.kind=="AscendJob").spec.replicaSpecs.{Master|Scheduler|Worker}.template.spec.containers[0].env[name==ASCEND_VISIBLE_DEVICES].valueFrom.fieldRef.fieldPath |
The value is in the format of metadata.annotations['huawei.com/AscendXXX'], where XXX indicates the processor model (910, 310, or 310P). The value must be the same as the actual processor type in the environment. |
Ascend Docker Runtime obtains the value of this parameter and mounts NPUs of the corresponding type to a container. NOTE:
This parameter applies only to the full NPU scheduling feature that uses the Volcano scheduler. If you use static vNPU scheduling and other schedulers, delete fields of this parameter from the example YAML file. |
(.kind=="AscendJob").metadata.labels.fault-scheduling |
grace |
Enables graceful deletion. The original pod is gracefully deleted first. If graceful deletion has not been successful within 15 minutes, it is forcibly deleted. Set this parameter to grace for process-level rescheduling and process-level online recovery. |
force |
Enable the forced deletion mode for a job to forcefully delete the original pod during the process. |
|
off |
The job does not use resumable training, but maxRetry of Kubernetes still takes effect. |
|
None (no fault-scheduling field) |
||
Other values |
||
(.kind=="AscendJob").metadata.labels.fault-retry-times |
fault-retry-times > 0 |
To rectify service plane faults, you must configure the number of unconditional retries on the service plane. NOTE:
|
None (no fault-retry-times) or 0 |
The job does not use the unconditional retry function and cannot detect service plane faults, but maxRetry of vcjob still takes effect. |
|
(.kind=="AscendJob").spec.runPolicy.backoffLimit |
backoffLimit > 0 |
Number of job rescheduling times. Number of rescheduling times when a job is faulty. If the number of rescheduling times is the same as the value of backoffLimit, the job will not be rescheduled. NOTE:
If both backoffLimit and fault-retry-times are configured, and the number of rescheduling times is the same as the value of either backoffLimit or fault-retry-times, rescheduling is not performed. |
None (no backoffLimit) or backoffLimit ≤ 0 |
The total number of rescheduling times is not limited. NOTE:
If backoffLimit is not configured but fault-retry-times is configured, the number of rescheduling times is specified by fault-retry-times. |
|
(.kind=="AscendJob").spec.replicaSpecs.{Master|Scheduler|Worker}.restartPolicy |
NOTE:
Training jobs of the vcjob type do not support ExitCode. |
Container restart policy. When unconditional retry upon service plane faults is configured, the value of this parameter must be Never. |
(.kind=="AscendJob").spec.replicaSpecs.{Master|Scheduler|Worker}.template.spec.terminationGracePeriodSeconds |
0 < terminationGracePeriodSeconds < grace-over-time |
Duration from the time when the container receives SIGTERM to the time when the container is forcibly stopped by Kubernetes. The value must be greater than 0 and less than the value of grace-over-time in the volcano-v{version}.yaml file. In addition, ensure that the checkpoint file can be saved completely. Change the value as required. For details, see Container Lifecycle Hooks on the Kubernetes official website. NOTE:
This field takes effect only when fault-scheduling is set to grace. If fault-scheduling is set to force, this field is invalid. |
(.kind=="AscendJob").spec.replicaSpecs.{Master|Scheduler|Worker}.template.spec.hostNetwork |
|
|
(.kind=="AscendJob").metadata.annotations.wait-reschedule-timeout |
30~270 |
Timeout interval for waiting for the rescheduling of the faulty node during process-level rescheduling, in seconds. The default value is 270. |
Configuration |
Description |
|---|---|
chip4-node8 |
One node has eight processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010). |
chip1-node2 |
One node has two processors. For example, one Atlas 300T training card can be equipped with only one processor, and one node can be equipped with a maximum of two Atlas 300T training cards. |
chip4-node4 |
One node has four processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010). |
chip8-node8 |
One node has eight processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A2 training server. |
chip8-node16 |
One node has 16 processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 200T A2 Box16 heterogeneous subrack. |
chip2-node16 |
One node has 16 processors, and two processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A3 SuperPoD Server. |
chip2-node16-sp |
One node has 16 processors, and two processors form on one interconnection ring, and multiple servers form a SuperPoD, for example, the processor layout of the Atlas 900 A3 SuperPoD. |