Node Pre-selection
Description
Check whether a node meets job requirements based on the number of Ascend AI processors required by the job and the number of available Ascend AI processors on the node. For Atlas training product, the number can only be 1, 2, and 4, and the processors can be selected only in one HCCL ring.
For example, if a job requires four Ascend AI processors and a node has four Ascend AI processors arranged as two in each of two HCCL rings, this node is not selected for job allocation.
Implementation Details
For details about code implementation, see CheckNodeNPUByTask in the open source code. GetTaskReqNPUNum is used to obtain the number of Ascend AI processors requested by a training job, and GetUsableTopFromNode is used to obtain the available NPU resources of a node. JudgeNodeAndTaskNPU checks whether NPU resources on a node meet job requirements.