O&M
Prerequisites
This function applies only when the rescheduling function is required and the autoStowing parameter has been set to false in the YAML startup file of Ascend Device Plugin.
Procedure
- Run the following command to add the processors whose health status is restored from unhealthy to healthy to the resource pool:
kubectl label nodes node_name huawei.com/Ascend910-Recover-
After the command is executed, the huawei.com/Ascend910-Recover label is deleted. Processors with this label are placed back in the resource pool for program scheduling.
This command is only used to clear the Recover label information. Do not use it to add labels.
- Run the following command to add the processors whose parameter plane network's health status is restored from unhealthy to healthy to the resource pool:
kubectl label nodes node_name huawei.com/Ascend910-NetworkRecover-
After this command is executed, the huawei.com/Ascend910-NetworkRecover label is deleted, and the corresponding processors in huawei.com/Ascend910-NetworkUnhealthy are also deleted.
This command is only used to clear the NetworkRecover label information. Do not use it to add labels.
Parent topic: Using Resumable Training on the CLI