(Optional) Configuring Components

If the resumable training function has been configured during the installation of Ascend Device Plugin and NodeD, skip this section. Otherwise, you need to configure Ascend Device Plugin and NodeD to enable this function.

Ascend Device Plugin Configuration

Ascend Device Plugin can be started only in containerized mode.

  1. Modify the startup YAML file of Ascend Device Plugin based on fault handling modes. That is, modify the following information in bold.
    1. Rescheduling mode

      In rescheduling mode, a fault of Ascend Device Plugin also triggers rescheduling.

      ...
            containers:
            - image: ascend-k8sdeviceplugin:v{version}
              name: device-plugin-01
              resources:
                requests:
                  memory: 500Mi
                  cpu: 500m
                limits:
                  memory: 500Mi
                  cpu: 500m
              command: [ "/bin/bash", "-c", "--"]
              args: [ "device-plugin  
                       -useAscendDocker=true 
                       -volcanoType=true                    # Volcano must be used in the rescheduling scenario.
                       -autoStowing=true                    # Whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, after the processor health status changes from unhealthy to healthy, or the network fault on the processor parameter plane is recovered, the processor will not be automatically added to the schedulable resource pool. This parameter applies only to Atlas training product.
                       -listWatchPeriod=5                   # Set the health status check period. The value range is [3, 1800], in seconds.
                       -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log 
                       -logLevel=0" ]
              securityContext:
                privileged: true
                readOnlyRootFilesystem: true
      ...
    2. (Optional) Graceful fault tolerance mode: Add the -hotReset field to the rescheduling configuration.
      • The graceful fault tolerance function has been removed. It will not be supported in PyTorch versions beyond 7.2.RC1 and MindSpore versions beyond 7.1.RC1.
      • The function corresponding to -hotReset = 1 has been removed.
      ...
            containers:
            - image: ascend-k8sdeviceplugin:v{version}
              name: device-plugin-01
              resources:
                requests:
                  memory: 500Mi
                  cpu: 500m
                limits:
                  memory: 500Mi
                  cpu: 500m
              command: [ "/bin/bash", "-c", "--"]
              args: [ "device-plugin  
                       -useAscendDocker=true 
                      -volcanoType=true                     # Volcano must be used in the rescheduling scenario.
                       -autoStowing=true                    # Whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, after the processor health status changes from unhealthy to healthy, or the network fault on the processor parameter plane is recovered, the processor will not be automatically added to the schedulable resource pool. This parameter applies only to Atlas training product.
                       -hotReset=1 # Enable graceful fault tolerance. The system will automatically reset the faulty processor.
                       -listWatchPeriod=5                   # Set the health status check period. The value range is [3, 1800], in seconds.
                       -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log 
                       -logLevel=0" ]
              securityContext:
                privileged: true
                readOnlyRootFilesystem: true
      ...
  2. Run the following command on the Kubernetes management node to start Ascend Device Plugin:
    kubectl apply -f device-plugin-xxx-v{version}.yaml
    For example, to start the component in an environment using Atlas training product, run the following command:
    kubectl apply -f device-plugin-volcano-v{version}.yaml

NodeD Configuration

Configure the interval for sending node status. You can manually modify the startup YAML file of NodeD to configure the interval for reporting node status.

  1. Go to the directory where the component is decompressed and run the following command to open the startup YAML file of NodeD:
    vi noded-v{version}.yaml
  2. Add the reportInterval parameter to the args line in the YAML file as follows:
    ...
              env:
                - name: NODE_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: spec.nodeName
              imagePullPolicy: Never
              command: [ "/bin/bash", "-c", "--"]
              args: [ "/usr/local/bin/noded -logFile=/var/log/mindx-dl/noded/noded.log -logLevel=0 -reportInterval=5" ]
              securityContext:
                readOnlyRootFilesystem: true
                allowPrivilegeEscalation: true
              volumeMounts:
                - name: log-noded
    ...