(Optional) Configuring Components

If the elastic training function has been configured when Ascend Device Plugin and NodeD are installed, skip this section. If the function is not configured, you need to configure MindCluster Ascend Device Plugin and MindCluster NodeD to enable the feature.

Ascend Device Plugin Configuration

When the rescheduling policy is enabled, a fault of Ascend Device Plugin also triggers rescheduling.

  1. Modify the startup YAML file of Ascend Device Plugin (modify the following content in bold):
    ...
          containers:
          - image: ascend-k8sdeviceplugin:v{version}
            name: device-plugin-01
            resources:
              requests:
                memory: 500Mi
                cpu: 500m
              limits:
                memory: 500Mi
                cpu: 500m
            command: [ "/bin/bash", "-c", "--"]
            args: [ "device-plugin  
                     -useAscendDocker=true 
                     -volcanoType=true                    # Volcano must be used in the rescheduling scenario.
                     -autoStowing=true                    # Whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, after the processor health status changes from unhealthy to healthy, or the network fault on the processor parameter plane is recovered, the processor will not be automatically added to the schedulable resource pool. This parameter applies only to Atlas training product.
                     -listWatchPeriod=5                   # Set the health status check period. The value range is [3, 1800], in seconds.
                     -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log 
                     -logLevel=0" ]
            securityContext:
              privileged: true
              readOnlyRootFilesystem: true
    ...
  2. Run the following command on the Kubernetes management node to start Ascend Device Plugin:
    kubectl apply -f device-plugin-xxx-v{version}.yaml
    For example, to start the component on Atlas training product, run the following command:
    kubectl apply -f device-plugin-volcano-v7.3.0.yaml

NodeD Configuration

You can manually modify the startup YAML file of NodeD to configure the node status reporting interval.

  1. Run the following command to edit the startup YAML file of NodeD:
    vi noded-v{version}.yaml
  2. Add the reportInterval parameter to the args line in the YAML file as follows:
    ...
              env:
                - name: NODE_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: spec.nodeName
              imagePullPolicy: Never
              command: [ "/bin/bash", "-c", "--"]
              args: [ "/home/hwMindX/noded -logFile=/var/log/mindx-dl/noded/noded.log -logLevel=0 -reportInterval=5" ]
              securityContext:
                readOnlyRootFilesystem: true
                allowPrivilegeEscalation: false
                capabilities:
                  drop: [ "ALL" ]
                runAsUser: 9000
                runAsGroup: 9000
              volumeMounts:
                - name: log-noded
    ...