PodGroup

Table 1 PodGroup labels of the cluster scheduling components

Parameter

Description

Value

Required Component

ring-controller.atlas

Pod that identifies "Atlas"

  • ascend-910
  • ascend-{xxx}b

Ascend Device Plugin, Ascend Operator, Volcano

fault-scheduling

Whether to enable fault rescheduling

grace, force, and off

Volcano, Resilience Controller

elastic-scheduling

Whether to enable job elastic scheduling

on

Resilience Controller, Volcano

fault-retry-times

Number of times that a job can be rescheduled due to a service plane fault

0–100

Volcano, Ascend Operator

tor-affinity

Switch affinity policy

  • normal-schema
  • large-model-schema
  • null

Volcano

npu-310-strategy

Scheduling policy of the inference server (with Atlas 300I inference cards)

  • card
  • chip

Volcano

pod-rescheduling

Whether to enable pod-level rescheduling

  • on: Enables pod-level rescheduling.
  • Other values or not using this field: Disables pod-level rescheduling.

Volcano

process-recover-enable

Whether to enable process-level rescheduling

  • on: Enables process-level rescheduling.
  • Other values or not using this field: Disables process-level rescheduling.

Volcano

subHealthyStrategy

Subhealth node handling policy

  • ignore: Ignore the subhealthy node. The node is not preferentially scheduled during affinity scheduling of subsequent jobs.
  • graceExit: Stop using the subhealthy node and perform rescheduling after the dying gasp checkpoint file is saved. Subsequent jobs will not be scheduled to this node.
  • forceExit: Stop using the subhealthy node, exit the job without saving files, and perform rescheduling. Subsequent jobs will not be scheduled to this node.
  • hotSwitch: Execute hot switching. After starting the backup pod, suspend the training job, and restart the training job on the new node.

Volcano

Table 2 PodGroup annotations of the cluster scheduling components

Resource

Description

Value

Required Component

sp-block

Number of processors on logical SuperPoDs

Integer

Volcano, Ascend Operator

huawei.com/schedule_policy

Scheduling policy.

See Table 3 for its configurations.

Volcano

sp-fit

SuperPoD scheduling policy

idlest: Scheduling to a more idle physical SuperPoD.

Volcano

huawei.com/schedule_minAvailable

Minimum number of replicas that can be scheduled by a job.

Integer

Volcano

huawei.com/recover_policy_path

Rescheduling policy

pod: Only pod-level rescheduling is supported. Rescheduling at the job level is not supported.

Volcano

huawei.com/schedule_enable_dequeue

Whether to dequeue a job (changing its status from Inqueue to Pending) This parameter needs to be manually configured.

  • on: enabled
  • Other values: disabled

If this parameter is not set, the function is disabled by default.

Volcano

huawei.com/schedule_dequeue_frequency

Number of times that a job is dequeued. The value is automatically updated by Volcano.

The value increases by 1 each time a job is dequeued.

NOTE:

Delete the value if the job is not in the Inqueue or Pending status.

Volcano

huawei.com/schedule_enqueue_time

Time when a job is enqueued (changing its status from Pending to Inqueue). The value is automatically updated by Volcano.

Milliseconds-level timestamp.

NOTE:
  • If enqueuing a job takes longer than 5 minutes and the dequeue function is enabled, the job is removed to free resources for other jobs.
  • Delete the value if the job is not in the Inqueue status.

Volcano

Table 3 huawei.com/schedule_policy configuration description

Configuration

Description

chip4-node8

One node has eight processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010).

chip1-node2

One node has two processors. For example, one Atlas 300T training card can be equipped with only one processor, and one node can be equipped with a maximum of two Atlas 300T training cards.

chip4-node4

One node has four processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010).

chip8-node8

One node has eight processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A2 training server.

chip8-node16

One node has 16 processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 200T A2 Box16 heterogeneous subrack.

chip2-node16

One node has 16 processors, and two processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A3 SuperPoD Server.

chip2-node16-sp

One node has 16 processors, and two processors form on one interconnection ring, and multiple servers form a SuperPoD, for example, the processor layout of the Atlas 900 A3 SuperPoD.