YAML Parameters

The following table describes only the fields related to MindCluster in the StormService YAML file of AIBrix.

Table 1 YAML parameters

Parameter

Value

Description

schedulerName

volcano

Volcano is used as a scheduler.

(Optional) host-arch

  • Arm: huawei-arm
  • x86_64: huawei-x86

Architecture of the node where a training job is executed. Set this parameter as required.

In a distributed training job, ensure that the nodes running the training job have the same architecture.

sp-block

Number of processors of logical SuperPoDs.

The value must be an integer multiple of the number of processors on a node, and the total number of processors requested by prefill/decode instances must be an integer multiple of the value.

Cluster scheduling components divide logical SuperPoDs on physical SuperPoDs based on the division policy for affinity scheduling of training jobs. If this field is not specified, Volcano sets the size of the logical SuperPoD of a job to the total number of NPUs configured for the job during scheduling.

For details, see UnifiedBus Interconnect Device Network Description.

NOTE:

It can be used only on the Atlas 900 A3 SuperPoD.

pod-rescheduling

  • on: Enable pod-level rescheduling.
  • Other values or not using this field: Disable pod-level rescheduling.

Pod-level rescheduling means when a job fails, not all job pods in PodGroup are deleted. Instead, only faulty pods are deleted, and the controller creates new pods for rescheduling.

NOTE:

If podGroupSize is set to 1, pod-rescheduling must be set to on. If podGroupSize is greater than 1, do not set this parameter.

huawei.com/schedule_minAvailable

Numeric string

Minimum number of replicas scheduled in the Gang scheduling policy. In StormService,

  • Instances with podGroupSize = 1 form one PodGroup for scheduling, and the schedulable replica count ranges from 1 to the sum of all instance replicas (recommended).
  • Each instance with podGroupSize > 1 forms an independent PodGroup and the schedulable replica count ranges from 1 to podGroupSize (recommended).

For example, for a prefill instance with podGroupSize = 1 and decode instance with podGroupSize = 2, the minimum number of schedulable replicas of the prefill instance is its number of replicas, and the minimum number of schedulable replicas of the decode instance is equal to its podGroupSize.

huawei.com/recover_policy_path

"pod"

Path for job execution recovery when pod-rescheduling is set to on. If it is set to pod, job-level rescheduling is not triggered when pod-level rescheduling fails. Each pod in the current PodGroup is an independent instance, so fault handling cannot be spread to other instances.

accelerator-type

  • Atlas 800I A2 inference server: module-910b-8
  • Atlas 800I A3 SuperPoD Server: module-a3-16
  • Atlas 900 A3 SuperPoD: module-a3-16-super-pod

Set this parameter based on the type of the node where a training job is executed.

huawei.com/Ascend910

  • Atlas 800I A2 inference server: 8
  • Atlas 900 A3 SuperPoD/Atlas 800I A3 SuperPoD Server: 16

Number of required NPUs. Currently, only full-server scheduling is supported. Set the value to the actual number of used NPUs.

env[name==ASCEND_VISIBLE_DEVICES].valueFrom.fieldRef.fieldPath

The value is metadata.annotations['huawei.com/Ascend910'], which must be the same as the actual processor type used in the environment.

Ascend Docker Runtime obtains the value of this parameter and mounts NPUs of the corresponding type to a container.

NOTE:

This parameter applies only to the full NPU scheduling feature of the Volcano scheduler. If you use static vNPU scheduling and other schedulers, delete this parameter from the example YAML file.

fault-scheduling

grace

Enables graceful deletion. The original pod is gracefully deleted first. If graceful deletion has not been successful within 15 minutes, it is forcibly deleted.

force

Enable the forcible deletion mode for a job to forcibly delete the original pod.

off

Rescheduling upon faults is disabled.

None (no fault-scheduling field)

Other values

fault-retry-times

0 < fault-retry-times

To rectify service plane faults, you must configure the number of unconditional retries on the service plane.

None (no fault-retry-times) or 0

Unconditional retry is not triggered, and Volcano does not delete the faulty pod after a service plane fault occurs.

restartPolicy

  • Never: never restart
  • Always: always restart
  • OnFailure: restart upon failures
  • ExitCode: determines whether to restart the pod based on the process exit code. If the code ranges from 1 to 127, the pod is not restarted. If the code ranges from 128 to 255, the pod is restarted.
    NOTE:

    Training jobs of the vcjob type do not support ExitCode.

Container restart policy. When unconditional retry upon service plane faults is configured, the value of this parameter must be Never.