Cluster Scheduling Component Configuration
To use the resumable training feature, complete the Ascend Device Plugin Configuration, NodeD Configuration, and Volcano Configuration before using them.
Ascend Device Plugin Configuration
Select a modification mode based on the startup mode of the Ascend Device Plugin. When the rescheduling policy is enabled, an exception of the Ascend Device Plugin also triggers rescheduling upon a fault.
To start the Ascend Device Plugin component, perform the following steps:
- Start it in binary mode. For details about the installation procedure, see Start in Binary Mode.
- Start it in containerized mode. For details about the installation procedure, see Start in Containerized Mode.
Start in Binary Mode
- Open the device-plugin.service configuration file of the Ascend Device Plugin service.
# The service configuration file is stored in this path by default. vim /etc/systemd/system/device-plugin.service
Set volcanoType and autoStowing to true and modify the following information in bold:
... [Service] ExecStart=/bin/bash -c "/usr/local/bin/device-plugin -volcanoType=true -autoStowing=true ..." ...
-volcanoType=true: Volcano must be used in the rescheduling scenario.
-autoStowing=true: indicates whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, when the processor health status changes from unhealthy to healthy, or the network fault on the processor parameter plane is recovered, the processor is not automatically added to the schedulable resource pool. This feature applies only to the Ascend 910 AI Processors.
- Restart the Ascend Device Plugin service.
systemctl daemon-reload systemctl restart device-plugin.service
Start in Containerized Mode
- Modify the startup YAML file of the Ascend Device Plugin component (modify the following content in bold):
... containers: - image: ascend-k8sdeviceplugin:v3.0.0 name: device-plugin-01 resources: requests: memory: 500Mi cpu: 500m limits: memory: 500Mi cpu: 500m command: [ "/bin/bash", "-c", "--"] args: [ "device-plugin -useAscendDocker=true -volcanoType=true # Volcano must be used in the rescheduling scenario. -autoStowing=true # Indicates whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, when the processor health status changes from unhealth to health, or the network fault on the processor parameter plane is recovered, the processor is not automatically added to the schedulable resource pool. It applies only to the Ascend 910 AI Processors. -listWatchPeriod=5 # Health check period. The value range is [3, 60]. The default value is 5 seconds. -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0" ] securityContext: privileged: true readOnlyRootFilesystem: true ... - Run the following command on the Kubernetes master node to start Ascend Device Plugin:
kubectl apply -f device-plugin-xxx-*.yaml
NodeD Configuration
The NodeD component configuration includes label configuration, NodeD monitoring configuration, and heartbeat sending interval configuration (optional). For details, see the following example.
Label Configuration
NodeD needs to be installed on all worker nodes. Therefore, before installing NodeD, run the following command to label all worker nodes with workerselector=dls-worker-node:
kubectl label node nodename workerselector=dls-worker-node --overwrite
In the preceding command, nodeName indicates the name of a node in the Kubernetes cluster.
(Optional) Configuring the Interval for Sending Heartbeat Messages
Edit the startup YAML file of NodeD and change the interval for NodeD to send heartbeat messages by setting -heartbeatInterval.
vim noded-*.yaml
Add the -heartbeatInterval parameter to the args line as follows:
By default, if Kubernetes does not receive any response from a node within 40 seconds, Kubernetes sets the node status to NotReady. If the Kubernetes configuration is not modified, use the default heartbeat interval (5) of NodeD. If the Kubernetes configuration is modified, you need to change the heartbeat interval of NodeD, which must be less than or equal to one sixth (rounded down) of the configured value in Kubernetes.
...
env:
- name: NODE_NAME
valueFrom:
fieldRef:
fieldPath: spec.nodeName
imagePullPolicy: Never
command: [ "/bin/bash", "-c", "--"]
args: [ "noded -logFile=/var/log/mindx-dl/noded/noded.log -logLevel=0 -heartbeatInterval=5" ]
securityContext:
readOnlyRootFilesystem: true
allowPrivilegeEscalation: false
capabilities:
drop: [ "ALL" ]
runAsUser: 9000
runAsGroup: 9000
volumeMounts:
- name: log-noded
...
NodeD Monitoring Configuration on a Node
The NodeD component for cluster scheduling periodically reports the node status. The node status is obtained by setting nodeDEnable to on or off. (To obtain the node status, install NodeD first.) on indicates that NodeD is allowed to obtain the node information to determine whether the node is faulty. If the parameter is set to another value or it does not exist, only the node information is reported and whether the node is faulty is not determined.
Run the following command on the master node:
kubectl label nodes nodeName nodeDEnable=on --overwrite
In the preceding command, nodeName indicates the node whose information is to be reported by NodeD.
Volcano Configuration
Set the time to gracefully delete the original pod in volcano-*.yaml as required. To enable the dying gasp function, you need to set this option. This configuration takes effect globally and affects training jobs in the current environment. You are advised to configure this parameter during Volcano installation and do not modify it during system running. The default parameter values and examples are as follows:
Parameter |
Default Value |
Value Range |
Description |
|---|---|---|---|
grace-over-time |
900, in seconds |
[2, 3600] |
Interval between the time when a PoD deletion is triggered and the time when the PoD is forcibly deleted. After the interval expires, the original pod is forcibly deleted. |
volcano-*.yaml example:
...
apiVersion: v1
kind: ConfigMap
metadata:
name: volcano-scheduler-configmap
namespace: volcano-system
data:
volcano-scheduler.conf: |
actions: "enqueue, allocate, backfill"
tiers:
- plugins:
- name: priority
- name: gang
- name: conformance
- name: volcano-npu-v3.0.0
- plugins:
- name: drf
- name: predicates
- name: proportion
- name: nodeorder
- name: binpack
configurations:
...
- name: init-params
arguments: {"grace-over-time":"900","presetVirtualDevice":"true"}
...