PodGroup
Parameter |
Description |
Value |
Required Component |
|---|---|---|---|
ring-controller.atlas |
Pod that identifies "Atlas" |
|
Ascend Device Plugin, Ascend Operator, Volcano |
fault-scheduling |
Whether to enable fault rescheduling |
grace, force, and off |
Volcano, Resilience Controller |
elastic-scheduling |
Whether to enable job elastic scheduling |
on |
Resilience Controller, Volcano |
fault-retry-times |
Number of times that a job can be rescheduled due to a service plane fault |
0–100 |
Volcano, Ascend Operator |
tor-affinity |
Switch affinity policy |
|
Volcano |
npu-310-strategy |
Scheduling policy of the inference server (with Atlas 300I inference cards) |
|
Volcano |
pod-rescheduling |
Whether to enable pod-level rescheduling |
|
Volcano |
process-recover-enable |
Whether to enable process-level rescheduling |
|
Volcano |
subHealthyStrategy |
Subhealth node handling policy |
|
Volcano |
Resource |
Description |
Value |
Required Component |
|---|---|---|---|
sp-block |
Number of processors on logical SuperPoDs |
Integer |
Volcano, Ascend Operator |
huawei.com/schedule_policy |
Scheduling policy. |
See Table 3 for its configurations. |
Volcano |
sp-fit |
SuperPoD scheduling policy |
idlest: Scheduling to a more idle physical SuperPoD. |
Volcano |
huawei.com/schedule_minAvailable |
Minimum number of replicas that can be scheduled by a job. |
Integer |
Volcano |
huawei.com/recover_policy_path |
Rescheduling policy |
pod: Only pod-level rescheduling is supported. Rescheduling at the job level is not supported. |
Volcano |
huawei.com/schedule_enable_dequeue |
Whether to dequeue a job (changing its status from Inqueue to Pending) This parameter needs to be manually configured. |
If this parameter is not set, the function is disabled by default. |
Volcano |
huawei.com/schedule_dequeue_frequency |
Number of times that a job is dequeued. The value is automatically updated by Volcano. |
The value increases by 1 each time a job is dequeued. NOTE: Delete the value if the job is not in the Inqueue or Pending status. |
Volcano |
huawei.com/schedule_enqueue_time |
Time when a job is enqueued (changing its status from Pending to Inqueue). The value is automatically updated by Volcano. |
Milliseconds-level timestamp. NOTE:
|
Volcano |
Configuration |
Description |
|---|---|
chip4-node8 |
One node has eight processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010). |
chip1-node2 |
One node has two processors. For example, one Atlas 300T training card can be equipped with only one processor, and one node can be equipped with a maximum of two Atlas 300T training cards. |
chip4-node4 |
One node has four processors, and four processors form an interconnection ring, for example, the processor layout of the Atlas 800 training server (model 9000) or Atlas 800 training server (model 9010). |
chip8-node8 |
One node has eight processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A2 training server. |
chip8-node16 |
One node has 16 processors, and eight processors form on one interconnection ring, for example, the processor layout of the Atlas 200T A2 Box16 heterogeneous subrack. |
chip2-node16 |
One node has 16 processors, and two processors form on one interconnection ring, for example, the processor layout of the Atlas 800T A3 SuperPoD Server. |
chip2-node16-sp |
One node has 16 processors, and two processors form on one interconnection ring, and multiple servers form a SuperPoD, for example, the processor layout of the Atlas 900 A3 SuperPoD. |