Job Configuring

For details, see YAML Parameters of resumable training. In addition, you need to add and modify the YAML configuration file.

...
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
  name: mindx-dls-test   # The value of this parameter must be consistent with the name of ConfigMap.
  namespace: vcjob       # Select a proper namespace as required. (The namespaces of ConfigMap and jobs must be the same.)
  labels:
    ring-controller.atlas: ascend-910  # The HCCL-Controller distinguishes scenarios with Ascend 910 and other processors configured.
    fault-scheduling: "grace"
    elastic-scheduling: "on"   # add "".
  annotations:
    minReplicas: "1"
spec:
  minAvailable: 2           # The value must be the same as that of replicas.
...
  maxRetry: 0
...
          lifecycle:  # To use the dying gasp function, add the code in bold.
           preStop:
             exec:
               command: ["/bin/bash", "-c", "cd /job/code/resnet/scripts; bash pre_stop.sh"]
          resources:
            requests:
              huawei.com/Ascend910: 8                                                # Set this parameter to 8 for distributed training across servers. Adjust the value based on the job during standalone training.
            limits:
              huawei.com/Ascend910: 8                                                # The value must be the same as that in requests.
...