PodGroup

**Table 1** PodGroup labels of the cluster scheduling components
Parameter	Description	Value	Required Component
ring-controller.atlas	Pod that identifies "Atlas"	ascend-910 *ascend-{xxx}b*	Ascend Device Plugin, Ascend Operator, Volcano
fault-scheduling	Whether to enable fault rescheduling	grace, force, and off	Volcano, Resilience Controller
elastic-scheduling	Whether to enable job elastic scheduling	on	Resilience Controller, Volcano
fault-retry-times	Number of times that a job can be rescheduled due to a service plane fault	0–100	Volcano, Ascend Operator
tor-affinity	Switch affinity policy	normal-schema large-model-schema null	Volcano
npu-310-strategy	Scheduling policy of the inference server (with Atlas 300I inference cards)	card chip	Volcano
pod-rescheduling	Whether to enable pod-level rescheduling	on: Enables pod-level rescheduling. Other values or not using this field: Disables pod-level rescheduling.	Volcano
process-recover-enable	Whether to enable process-level rescheduling	on: Enables process-level rescheduling. Other values or not using this field: Disables process-level rescheduling.	Volcano
subHealthyStrategy	Subhealth node handling policy	ignore: Ignore the subhealthy node. The node is not preferentially scheduled during affinity scheduling of subsequent jobs. graceExit: Stop using the subhealthy node and perform rescheduling after the dying gasp checkpoint file is saved. Subsequent jobs will not be scheduled to this node. forceExit: Stop using the subhealthy node, exit the job without saving files, and perform rescheduling. Subsequent jobs will not be scheduled to this node. hotSwitch: Execute hot switching. After starting the backup pod, suspend the training job, and restart the training job on the new node.	Volcano

**Table 2** PodGroup annotations of the cluster scheduling components
Resource	Description	Value	Required Component
sp-block	Number of processors on logical SuperPoDs	Integer	Volcano, Ascend Operator
huawei.com/schedule_policy	Scheduling policy.	See Table 3 for its configurations.	Volcano
sp-fit	SuperPoD scheduling policy	idlest: Scheduling to a more idle physical SuperPoD.	Volcano
huawei.com/schedule_minAvailable	Minimum number of replicas that can be scheduled by a job.	Integer	Volcano
huawei.com/recover_policy_path	Rescheduling policy	pod: Only pod-level rescheduling is supported. Rescheduling at the job level is not supported.	Volcano
huawei.com/schedule_enable_dequeue	Whether to dequeue a job (changing its status from Inqueue to Pending) This parameter needs to be manually configured.	on: enabled Other values: disabled If this parameter is not set, the function is disabled by default.	Volcano
huawei.com/schedule_dequeue_frequency	Number of times that a job is dequeued. The value is automatically updated by Volcano.	The value increases by 1 each time a job is dequeued. NOTE: Delete the value if the job is not in the Inqueue or Pending status.	Volcano
huawei.com/schedule_enqueue_time	Time when a job is enqueued (changing its status from Pending to Inqueue). The value is automatically updated by Volcano.	Milliseconds-level timestamp. NOTE: If enqueuing a job takes longer than 5 minutes and the dequeue function is enabled, the job is removed to free resources for other jobs. Delete the value if the job is not in the Inqueue status.	Volcano

**Table 3** huawei.com/schedule_policy configuration description
Configuration	Description
chip4-node8	One node has eight processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010).
chip1-node2	One node has two processors. For example, one Atlas 300T training card can be equipped with only one processor, and one node can be equipped with a maximum of two Atlas 300T training cards.
chip4-node4	One node has four processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010).
chip8-node8	One node has eight processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A2 training server.
chip8-node16	One node has 16 processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 200T A2 Box16 heterogeneous subrack.
chip2-node16	One node has 16 processors, and two processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A3 SuperPoD Server.
chip2-node16-sp	One node has 16 processors, and two processors form on one interconnection ring, and multiple servers form a SuperPoD, for example, the processor layout of the Atlas 900 A3 SuperPoD.

Parent topic: Volcano