Affinity Scheduling of Logical SuperPoDs
Instructions
- The number of logical SuperPoDs must be less than the number of physical SuperPoDs.
- Nodes in a logical SuperPoD must be in the corresponding physical SuperPoD.
- The rank IDs of NPUs in a logical SuperPoD are consecutive.
Common Job Scheduling
- Logical SuperPoD scheduling preferentially ensures that there are reserved nodes in a physical SuperPoD, and then preferentially uses the SuperPoD with fewer remaining nodes.
- You need to specify the sp-block field in the YAML file of a job to specify the number of processors in a logical SuperPoD. In single-node mode, the value must be the same as the number of processors requested by the job. In distributed mode, the value must be an integer multiple of the number of processors on the node, and the total number of processors for the job must be an integer multiple of the number of processors on the node. If this field is not specified, Volcano sets the size of the logical SuperPoD of a job to the total number of NPUs configured for the job during scheduling.
Rescheduling Upon Faults
- If no node in a logical SuperPoD is faulty, nodes in the logical SuperPoD are used during rescheduling.
- If some nodes in a logical SuperPoD are faulty and unavailable, nodes are selected from the corresponding physical SuperPoD, and other nodes remain unchanged.
- If the remaining nodes in a physical SuperPoD cannot meet the requirements of the logical SuperPoD, all jobs on the logical SuperPoD are scheduled to other physical SuperPoDs.
MindIE Service Inference Job Scheduling
MindIE Service inference jobs must comply with the following affinity scheduling policies. For more details, see Configuring Instance-Level Affinity Scheduling.
- Specify the sp-block field in the job YAML file. The value of sp-block must be the same as the number of processors required by the job to ensure that the entire job can be scheduled to a physical SuperPoD.
- Ensure that there are reserved nodes in a physical SuperPoD for logical SuperPoD scheduling.
- Nodes in the same physical SuperPoD communicate with each other using the internal HCCS network.
- If sp-fit is set to idlest,the job is scheduled to a more idle physical SuperPoD.
- If podAffinity is set, the job is scheduled to a physical SuperPoD with more affinity pods.
Parent topic: Node-based Affinity