(Optional) Configuring Components
If the resumable training function has been configured during the installation of Ascend Device Plugin and NodeD, skip this section. Otherwise, you need to configure Ascend Device Plugin and NodeD to enable this function.
Ascend Device Plugin Configuration
Ascend Device Plugin can be started only in containerized mode.
- Modify the startup YAML file of Ascend Device Plugin based on fault handling modes. That is, modify the following information in bold.
- Rescheduling mode
In rescheduling mode, a fault of Ascend Device Plugin also triggers rescheduling.
... containers: - image: ascend-k8sdeviceplugin:v{version} name: device-plugin-01 resources: requests: memory: 500Mi cpu: 500m limits: memory: 500Mi cpu: 500m command: [ "/bin/bash", "-c", "--"] args: [ "device-plugin -useAscendDocker=true -volcanoType=true # Volcano must be used in the rescheduling scenario. -autoStowing=true # Whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, after the processor health status changes from unhealthy to healthy, or the network fault on the processor parameter plane is recovered, the processor will not be automatically added to the schedulable resource pool. This parameter applies only to Atlas training product. -listWatchPeriod=5 # Set the health status check period. The value range is [3, 1800], in seconds. -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0" ] securityContext: privileged: true readOnlyRootFilesystem: true ... - (Optional) Graceful fault tolerance mode: Add the -hotReset field to the rescheduling configuration.
- The graceful fault tolerance function has been removed. It will not be supported in PyTorch versions beyond 7.2.RC1 and MindSpore versions beyond 7.1.RC1.
- The function corresponding to -hotReset = 1 has been removed.
... containers: - image: ascend-k8sdeviceplugin:v{version} name: device-plugin-01 resources: requests: memory: 500Mi cpu: 500m limits: memory: 500Mi cpu: 500m command: [ "/bin/bash", "-c", "--"] args: [ "device-plugin -useAscendDocker=true -volcanoType=true # Volcano must be used in the rescheduling scenario. -autoStowing=true # Whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, after the processor health status changes from unhealthy to healthy, or the network fault on the processor parameter plane is recovered, the processor will not be automatically added to the schedulable resource pool. This parameter applies only to Atlas training product. -hotReset=1 # Enable graceful fault tolerance. The system will automatically reset the faulty processor. -listWatchPeriod=5 # Set the health status check period. The value range is [3, 1800], in seconds. -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0" ] securityContext: privileged: true readOnlyRootFilesystem: true ...
- Rescheduling mode
- Run the following command on the Kubernetes management node to start Ascend Device Plugin:
kubectl apply -f device-plugin-xxx-v{version}.yamlFor example, to start the component in an environment using Atlas training product, run the following command:kubectl apply -f device-plugin-volcano-v{version}.yaml
NodeD Configuration
Configure the interval for sending node status. You can manually modify the startup YAML file of NodeD to configure the interval for reporting node status.
- Go to the directory where the component is decompressed and run the following command to open the startup YAML file of NodeD:
vi noded-v{version}.yaml - Add the reportInterval parameter to the args line in the YAML file as follows:
... env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName imagePullPolicy: Never command: [ "/bin/bash", "-c", "--"] args: [ "/usr/local/bin/noded -logFile=/var/log/mindx-dl/noded/noded.log -logLevel=0 -reportInterval=5" ] securityContext: readOnlyRootFilesystem: true allowPrivilegeEscalation: true volumeMounts: - name: log-noded ...
Parent topic: Using Resumable Training on the CLI