Job Configuring
For details, see YAML Parameters of resumable training. In addition, you need to add and modify the YAML configuration file.
...
apiVersion: batch.volcano.sh/v1alpha1
kind: Job
metadata:
name: mindx-dls-test # The value of this parameter must be consistent with the name of ConfigMap.
namespace: vcjob # Select a proper namespace as required. (The namespaces of ConfigMap and jobs must be the same.)
labels:
ring-controller.atlas: ascend-910 # The HCCL-Controller distinguishes scenarios with Ascend 910 and other processors configured.
fault-scheduling: "grace"
elastic-scheduling: "on" # add "".
annotations:
minReplicas: "1"
spec:
minAvailable: 2 # The value must be the same as that of replicas.
...
maxRetry: 0
...
lifecycle: # To use the dying gasp function, add the code in bold.
preStop:
exec:
command: ["/bin/bash", "-c", "cd /job/code/resnet/scripts; bash pre_stop.sh"]
resources:
requests:
huawei.com/Ascend910: 8 # Set this parameter to 8 for distributed training across servers. Adjust the value based on the job during standalone training.
limits:
huawei.com/Ascend910: 8 # The value must be the same as that in requests.
...
Parent topic: Example of Minimum Service System