Cluster Scheduling Component Configuration

To use the resumable training feature, complete the Ascend Device Plugin Configuration, NodeD Configuration, and Volcano Configuration before using them.

Ascend Device Plugin Configuration

Select a modification mode based on the startup mode of the Ascend Device Plugin. When the rescheduling policy is enabled, an exception of the Ascend Device Plugin also triggers rescheduling upon a fault.

To start the Ascend Device Plugin component, perform the following steps:

Start in Binary Mode

  1. Open the device-plugin.service configuration file of the Ascend Device Plugin service.
    # The service configuration file is stored in this path by default.
    vim /etc/systemd/system/device-plugin.service

    Set volcanoType and autoStowing to true and modify the following information in bold:

    ...
    [Service]
    ExecStart=/bin/bash -c "/usr/local/bin/device-plugin -volcanoType=true -autoStowing=true ..."
    ...

    -volcanoType=true: Volcano must be used in the rescheduling scenario.

    -autoStowing=true: indicates whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, when the processor health status changes from unhealthy to healthy, or the network fault on the processor parameter plane is recovered, the processor is not automatically added to the schedulable resource pool. This feature applies only to the Ascend 910 AI Processors.

  2. Restart the Ascend Device Plugin service.
    systemctl daemon-reload 
    systemctl restart device-plugin.service

Start in Containerized Mode

  1. Modify the startup YAML file of the Ascend Device Plugin component (modify the following content in bold):
    ...
          containers:
          - image: ascend-k8sdeviceplugin:v3.0.0
            name: device-plugin-01
            resources:
              requests:
                memory: 500Mi
                cpu: 500m
              limits:
                memory: 500Mi
                cpu: 500m
            command: [ "/bin/bash", "-c", "--"]
            args: [ "device-plugin  
                     -useAscendDocker=true 
                     -volcanoType=true                    # Volcano must be used in the rescheduling scenario.
                     -autoStowing=true                    # Indicates whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, when the processor health status changes from unhealth to health, or the network fault on the processor parameter plane is recovered, the processor is not automatically added to the schedulable resource pool. It applies only to the Ascend 910 AI Processors.
                     -listWatchPeriod=5                   # Health check period. The value range is [3, 60]. The default value is 5 seconds.
                     -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log 
                     -logLevel=0" ]
            securityContext:
              privileged: true
              readOnlyRootFilesystem: true
    ...
  2. Run the following command on the Kubernetes master node to start Ascend Device Plugin:
    kubectl apply -f device-plugin-xxx-*.yaml

NodeD Configuration

The NodeD component configuration includes label configuration, NodeD monitoring configuration, and heartbeat sending interval configuration (optional). For details, see the following example.

Label Configuration

NodeD needs to be installed on all worker nodes. Therefore, before installing NodeD, run the following command to label all worker nodes with workerselector=dls-worker-node:

kubectl label node nodename workerselector=dls-worker-node --overwrite

In the preceding command, nodeName indicates the name of a node in the Kubernetes cluster.

(Optional) Configuring the Interval for Sending Heartbeat Messages

Edit the startup YAML file of NodeD and change the interval for NodeD to send heartbeat messages by setting -heartbeatInterval.

vim noded-*.yaml

Add the -heartbeatInterval parameter to the args line as follows:

By default, if Kubernetes does not receive any response from a node within 40 seconds, Kubernetes sets the node status to NotReady. If the Kubernetes configuration is not modified, use the default heartbeat interval (5) of NodeD. If the Kubernetes configuration is modified, you need to change the heartbeat interval of NodeD, which must be less than or equal to one sixth (rounded down) of the configured value in Kubernetes.

...
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          imagePullPolicy: Never
          command: [ "/bin/bash", "-c", "--"]
          args: [ "noded -logFile=/var/log/mindx-dl/noded/noded.log -logLevel=0 -heartbeatInterval=5" ]
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: false
            capabilities:
              drop: [ "ALL" ]
            runAsUser: 9000
            runAsGroup: 9000
          volumeMounts:
            - name: log-noded
...

NodeD Monitoring Configuration on a Node

The NodeD component for cluster scheduling periodically reports the node status. The node status is obtained by setting nodeDEnable to on or off. (To obtain the node status, install NodeD first.) on indicates that NodeD is allowed to obtain the node information to determine whether the node is faulty. If the parameter is set to another value or it does not exist, only the node information is reported and whether the node is faulty is not determined.

Run the following command on the master node:

kubectl label nodes nodeName nodeDEnable=on --overwrite

In the preceding command, nodeName indicates the node whose information is to be reported by NodeD.

Volcano Configuration

Set the time to gracefully delete the original pod in volcano-*.yaml as required. To enable the dying gasp function, you need to set this option. This configuration takes effect globally and affects training jobs in the current environment. You are advised to configure this parameter during Volcano installation and do not modify it during system running. The default parameter values and examples are as follows:

Table 1 Parameter for gracefully deleting the original PoD

Parameter

Default Value

Value Range

Description

grace-over-time

900, in seconds

[2, 3600]

Interval between the time when a PoD deletion is triggered and the time when the PoD is forcibly deleted. After the interval expires, the original pod is forcibly deleted.

volcano-*.yaml example:

...
apiVersion: v1
kind: ConfigMap
metadata:
  name: volcano-scheduler-configmap
  namespace: volcano-system
data:
  volcano-scheduler.conf: |
    actions: "enqueue, allocate, backfill"
    tiers:
    - plugins:
      - name: priority
      - name: gang
      - name: conformance
      - name: volcano-npu-v3.0.0
    - plugins:
      - name: drf
      - name: predicates
      - name: proportion
      - name: nodeorder
      - name: binpack
    configurations:
     ...
      - name: init-params
        arguments: {"grace-over-time":"900","presetVirtualDevice":"true"}
...