Ascend Operator

  • Ascend Operator must be installed when you need to use functions of full NPU scheduling (training), static vNPU scheduling (training), resumable training, or elastic training. If Volcano is used as the scheduler, install Volcano first. Otherwise, Ascend Operator fails to be started.
  • To use the full NPU scheduling (inference) and rescheduling upon inference card faults and to deliver distributed inference jobs of the acjob type, Ascend Operator must be installed.
  • If you use only the functions of containerization, resource monitoring, recovery of inference card faults, or rescheduling upon inference card faults (single-server jobs), you do not need to install Ascend Operator. In this case, skip this section.

Ascend Operator allows a maximum of 20,000 replicas for a single Ascend Job.

Procedure

  1. Log in to the Kubernetes management node as the root user and check whether the Ascend Operator image and version number are correct.
    docker images | grep ascend-operator
    Command output:
    1
    ascend-operator                      v7.3.0              c532e9d0889c        About an hour ago         137MB
    
  2. Copy the YAML file in the directory where the Ascend Operator package is decompressed to any directory on the Kubernetes management node.
  3. Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the Ascend Operator startup parameters in the YAML file based on your requirements. For details about the startup parameters, see Table 1. You can run the ./ascend-operator -h command to view the parameter description.
  4. (Optional) Use Ascend Operator to generate a collective communication configuration file (RankTable file, also called hccl.json) for training jobs under PyTorch or MindSpore to shorten the cluster communication link setup time. If you use other frameworks, skip this step.
    1. By default, the parent directory of the hccl.json file is mounted to the startup YAML file. You can change the directory as required.
      ...
              - name: ranktable-dir
                mountPath: /user/mindx-dl/ranktable       # Path in the container, which cannot be changed.
      ...
            volumes:
              - name: ascend-operator-log
                hostPath:
                  path: /var/log/mindx-dl/ascend-operator
                  type: Directory
              - name: ranktable-dir
                hostPath:
                  path: /user/mindx-dl/ranktable      # Path on the host, which must be the same as the root directory of the path for saving the hccl.json file in the job YAML file.
                  type: DirectoryOrCreate                                     # Checks whether a given folder exists. If it does not exist, an empty folder is created.
      ...
      • The RankTable root directory is fixed in the container but can be modified on the host. When you deploy a job, the root directory of the path for saving the hccl.json file in the job YAML file must be the same as that on the host.
      • The permission on the RankTable root directory must meet either of the following conditions:
        • The user and user group are hwMindX (default running user of cluster scheduling components).
        • The permission on the RankTable root directory is 777.
    2. Run the following command to create a mount path for the hccl.json file in the parent directory:
      mkdir -m 777 /user/mindx-dl/ranktable/{Mount path}
  5. Run the following command in the directory where the YAML file of the management node is stored to start Ascend Operator.
    kubectl apply -f ascend-operator-v{version}.yaml

    Startup example:

    deployment.apps/ascend-operator-manager created
    serviceaccount/ascend-operator-manager created
    clusterrole.rbac.authorization.k8s.io/ascend-operator-manager-role created
    clusterrolebinding.rbac.authorization.k8s.io/ascend-operator-manager-rolebinding created
    customresourcedefinition.apiextensions.k8s.io/ascendjobs.mindxdl.gitee.com created
    ...
  6. Run the following command to check whether the component is started successfully:
    kubectl get pod -n mindx-dl

    The following is a startup example. If Running is displayed, the component is started successfully.

    1
    2
    3
    NAME                                         READY   STATUS    RESTARTS   AGE
    ...
    ascend-operator-7667495b6b-hwmjw      1/1    Running  0         11s
    

Parameters

Table 1 Ascend Operator startup parameters

Parameter

Type

Default Value

Description

-version

Bool

false

Whether to query the Ascend Operator version number.

  • true: queries the version.
  • false: does not query the version.

-logLevel

Integer

0

Log level. The options are as follows:

  • -1: debug
  • 0: info
  • 1: warning
  • 2: error
  • 3: critical

-maxAge

Integer

7

Time limit for backing up logs. The value ranges from 7 to 700, in days.

-logFile

String

/var/log/mindx-dl/ascend-operator/ascend-operator.log

Log file.

NOTE:

If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. Dumped files are named in the format of "ascend-operator-dump triggering time.log", for example, ascend-operator-2023-10-07T03-38-24.402.log.

-maxBackups

Integer

30

Maximum number of dumped log files that can be retained. The value ranges from 1 to 30.

-enableGangScheduling

Bool

true

Whether to enable gang scheduling. It is enabled by default. In this case, jobs are scheduled based on the specified scheduler. For details about the gang policy, see the official documentation of open source Volcano.

  • true: enabled.

    true must be set when job-level auto scaling is used.

  • false: disabled

-isCompress

Bool

false

Whether to compress and dump log files when the log file size reaches the dump threshold. (This parameter will be discarded later.)

  • true: enabled
  • false: disabled

-kubeconfig

String

None

Path of KubeConfig. This parameter is mandatory when the program runs outside the cluster.

-kubeApiBurst

Integer

100

Burst traffic used for communication with Kubernetes. The value range is (0, 10000]. If the value is not in the range, the default value 100 is used.

-kubeApiQps

Float32

50

Queries per second (QPS) used for communication with Kubernetes. The value range is (0, 10000]. If the value is not in the range, the default value 50 is used.

-h or --help

None

None

Help information.