Ascend Operator
- Ascend Operator must be installed when you need to use functions of full NPU scheduling (training), static vNPU scheduling (training), resumable training, or elastic training. If Volcano is used as the scheduler, install Volcano first. Otherwise, Ascend Operator fails to be started.
- To use the full NPU scheduling (inference) and rescheduling upon inference card faults and to deliver distributed inference jobs of the acjob type, Ascend Operator must be installed.
- If you use only the functions of containerization, resource monitoring, recovery of inference card faults, or rescheduling upon inference card faults (single-server jobs), you do not need to install Ascend Operator. In this case, skip this section.
Ascend Operator allows a maximum of 20,000 replicas for a single Ascend Job.
Procedure
- Log in to the Kubernetes management node as the root user and check whether the Ascend Operator image and version number are correct.
docker images | grep ascend-operator
Command output:1ascend-operator v7.3.0 c532e9d0889c About an hour ago 137MB
- If correct, proceed to Step 2.
- If not correct, create the image and distribute it by referring to Preparing an Image.
- Copy the YAML file in the directory where the Ascend Operator package is decompressed to any directory on the Kubernetes management node.
- Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the Ascend Operator startup parameters in the YAML file based on your requirements. For details about the startup parameters, see Table 1. You can run the ./ascend-operator -h command to view the parameter description.
- (Optional) Use Ascend Operator to generate a collective communication configuration file (RankTable file, also called hccl.json) for training jobs under PyTorch or MindSpore to shorten the cluster communication link setup time. If you use other frameworks, skip this step.
- By default, the parent directory of the hccl.json file is mounted to the startup YAML file. You can change the directory as required.
... - name: ranktable-dir mountPath: /user/mindx-dl/ranktable # Path in the container, which cannot be changed. ... volumes: - name: ascend-operator-log hostPath: path: /var/log/mindx-dl/ascend-operator type: Directory - name: ranktable-dir hostPath: path: /user/mindx-dl/ranktable # Path on the host, which must be the same as the root directory of the path for saving the hccl.json file in the job YAML file. type: DirectoryOrCreate # Checks whether a given folder exists. If it does not exist, an empty folder is created. ...
- The RankTable root directory is fixed in the container but can be modified on the host. When you deploy a job, the root directory of the path for saving the hccl.json file in the job YAML file must be the same as that on the host.
- The permission on the RankTable root directory must meet either of the following conditions:
- The user and user group are hwMindX (default running user of cluster scheduling components).
- The permission on the RankTable root directory is 777.
- Run the following command to create a mount path for the hccl.json file in the parent directory:
mkdir -m 777 /user/mindx-dl/ranktable/{Mount path}
- By default, the parent directory of the hccl.json file is mounted to the startup YAML file. You can change the directory as required.
- Run the following command in the directory where the YAML file of the management node is stored to start Ascend Operator.
kubectl apply -f ascend-operator-v{version}.yamlStartup example:
deployment.apps/ascend-operator-manager created serviceaccount/ascend-operator-manager created clusterrole.rbac.authorization.k8s.io/ascend-operator-manager-role created clusterrolebinding.rbac.authorization.k8s.io/ascend-operator-manager-rolebinding created customresourcedefinition.apiextensions.k8s.io/ascendjobs.mindxdl.gitee.com created ...
- Run the following command to check whether the component is started successfully:
kubectl get pod -n mindx-dl
The following is a startup example. If Running is displayed, the component is started successfully.
1 2 3
NAME READY STATUS RESTARTS AGE ... ascend-operator-7667495b6b-hwmjw 1/1 Running 0 11s
- After the component is installed, if the pod status of the component is not Running, refer to Component pods Are Not in the Running State.
- After the component is installed, if the pod status of the component is ContainerCreating, refer to Cluster Scheduling Component Pods Are in the ContainerCreating State.
- If the component fails to be started, refer to Cluster Scheduling Components Fail to Start and "get sem errno =13" Is Displayed in Logs.
- If the component is started successfully, but the corresponding pod cannot be found, refer to YAML File for Starting a Component Is Successfully Executed, But the pod Corresponding to the Component Is Not Displayed.
Parameters
Parent topic: Manual Installation