Job Delivery
Procedure
- In the example a800_tensorflow_vcjob.yaml file, the training job is deployed under the vcjob namespace, so you need to execute the following command on the management node to create a namespace for the job. If the job is created in a non-default namespace, create the namespace according to the actual situation.
kubectl create namespace vcjob
- Run the following command in the path of the sample YAML file on the management node to deliver a training job using the YAML file:
kubectl apply -f XXX.yaml
If the YAML file of a job is modified after the job is successfully delivered, run the kubectl delete -f XXX.yaml command to delete the original job and then deliver the job again.
- The following is an example of configuring resource information using environment variables:
kubectl apply -f tensorflow_standalone_acjob.yaml
Command output:1ascendjob.mindxdl.gitee.com/default-tensorflow-test created
- The following is an example of configuring resource information using a file:
kubectl apply -f a800_tensorflow_vcjob.yamlCommand output:1 2
configmap/rings-config-mindx-dls-test created job.batch.volcano.sh/mindx-dls-test created
- The following is an example of configuring resource information using environment variables:
- If a training job is always in the Pending status after being delivered, refer to Training Job Is in the Pending State Because "nodes are unavailable" or A Job Is Pending Due to Insufficient Resources to rectify the fault.
- If the hccl.json file of a training job container is in the Initializing status after a training job is started, refer to Failed to Generate the hccl.json File to rectify the fault.
Parent topic: Use on the CLI (Volcano)