(Optional) Configuring Components
If the elastic training function has been configured when Ascend Device Plugin and NodeD are installed, skip this section. If the function is not configured, you need to configure MindCluster Ascend Device Plugin and MindCluster NodeD to enable the feature.
Ascend Device Plugin Configuration
When the rescheduling policy is enabled, a fault of Ascend Device Plugin also triggers rescheduling.
- Modify the startup YAML file of Ascend Device Plugin (modify the following content in bold):
... containers: - image: ascend-k8sdeviceplugin:v{version} name: device-plugin-01 resources: requests: memory: 500Mi cpu: 500m limits: memory: 500Mi cpu: 500m command: [ "/bin/bash", "-c", "--"] args: [ "device-plugin -useAscendDocker=true -volcanoType=true # Volcano must be used in the rescheduling scenario. -autoStowing=true # Whether to enable automatic management. The default value is true. If this parameter is set to false, automatic management is disabled. In this case, after the processor health status changes from unhealthy to healthy, or the network fault on the processor parameter plane is recovered, the processor will not be automatically added to the schedulable resource pool. This parameter applies only to Atlas training product. -listWatchPeriod=5 # Set the health status check period. The value range is [3, 1800], in seconds. -logFile=/var/log/mindx-dl/devicePlugin/devicePlugin.log -logLevel=0" ] securityContext: privileged: true readOnlyRootFilesystem: true ... - Run the following command on the Kubernetes management node to start Ascend Device Plugin:
kubectl apply -f device-plugin-xxx-v{version}.yamlFor example, to start the component on Atlas training product, run the following command:kubectl apply -f device-plugin-volcano-v7.3.0.yaml
NodeD Configuration
You can manually modify the startup YAML file of NodeD to configure the node status reporting interval.
- Run the following command to edit the startup YAML file of NodeD:
vi noded-v{version}.yaml - Add the reportInterval parameter to the args line in the YAML file as follows:
... env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName imagePullPolicy: Never command: [ "/bin/bash", "-c", "--"] args: [ "/home/hwMindX/noded -logFile=/var/log/mindx-dl/noded/noded.log -logLevel=0 -reportInterval=5" ] securityContext: readOnlyRootFilesystem: true allowPrivilegeEscalation: false capabilities: drop: [ "ALL" ] runAsUser: 9000 runAsGroup: 9000 volumeMounts: - name: log-noded ...
Parent topic: Use on the CLI (Volcano)
