(Optional) Configuring Components

If the resumable training function has been configured during the installation of Ascend Device Plugin and NodeD, skip this section. Otherwise, you need to configure Ascend Device Plugin and NodeD to enable this function.

Ascend Device Plugin Configuration

Ascend Device Plugin can be started only in containerized mode.

Modify the startup YAML file of Ascend Device Plugin based on fault handling modes. That is, modify the following information in bold.

Rescheduling mode

In rescheduling mode, a fault of Ascend Device Plugin also triggers rescheduling.

...
      containers:
      - image: ascend-k8sdeviceplugin:v{version}
        name: device-plugin-01
        resources:
          requests:
            memory: 500Mi
            cpu: 500m
          limits:
            memory: 500Mi
            cpu: 500m
        command: [ "/bin/bash", "-c", "--"]
        args: [ "device-plugin  
                 -useAscendDocker=true 
                 -volcanoType=true                    # Volcano must be used in the rescheduling scenario.
                 -autoStowing=true                    # Whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, after the processor health status changes from unhealthy to healthy, or the network fault on the processor parameter plane is recovered, the processor will not be automatically added to the schedulable resource pool. This parameter applies only to Atlas training product.
                 -listWatchPeriod=5                   # Set the health status check period. The value range is [3, 1800], in seconds.
                 -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log 
                 -logLevel=0" ]
        securityContext:
          privileged: true
          readOnlyRootFilesystem: true
...

(Optional) Graceful fault tolerance mode: Add the -hotReset field to the rescheduling configuration.

The graceful fault tolerance function has been removed. It will not be supported in PyTorch versions beyond 7.2.RC1 and MindSpore versions beyond 7.1.RC1.
The function corresponding to -hotReset = 1 has been removed.

...
      containers:
      - image: ascend-k8sdeviceplugin:v{version}
        name: device-plugin-01
        resources:
          requests:
            memory: 500Mi
            cpu: 500m
          limits:
            memory: 500Mi
            cpu: 500m
        command: [ "/bin/bash", "-c", "--"]
        args: [ "device-plugin  
                 -useAscendDocker=true 
                -volcanoType=true                     # Volcano must be used in the rescheduling scenario.
                 -autoStowing=true                    # Whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, after the processor health status changes from unhealthy to healthy, or the network fault on the processor parameter plane is recovered, the processor will not be automatically added to the schedulable resource pool. This parameter applies only to Atlas training product.
                 -hotReset=1 # Enable graceful fault tolerance. The system will automatically reset the faulty processor.
                 -listWatchPeriod=5                   # Set the health status check period. The value range is [3, 1800], in seconds.
                 -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log 
                 -logLevel=0" ]
        securityContext:
          privileged: true
          readOnlyRootFilesystem: true
...

Run the following command on the Kubernetes management node to start Ascend Device Plugin:
```
kubectl apply -f device-plugin-xxx-v{version}.yaml
```
For example, to start the component in an environment using Atlas training product, run the following command:
```
kubectl apply -f device-plugin-volcano-v{version}.yaml
```

NodeD Configuration

Configure the interval for sending node status. You can manually modify the startup YAML file of NodeD to configure the interval for reporting node status.

Go to the directory where the component is decompressed and run the following command to open the startup YAML file of NodeD:
```
vi noded-v{version}.yaml
```

Add the reportInterval parameter to the args line in the YAML file as follows:

...
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          imagePullPolicy: Never
          command: [ "/bin/bash", "-c", "--"]
          args: [ "/usr/local/bin/noded -logFile=/var/log/mindx-dl/noded/noded.log -logLevel=0 -reportInterval=5" ]
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: true
          volumeMounts:
            - name: log-noded
...

If no response is received from the node within 40 seconds by default, Kubernetes sets the node status to NotReady.
When the request pressure of Kubernetes API server increases, increase the interval based on the actual situation to reduce the API server stress.

Parent topic: Using Resumable Training on the CLI