ClusterD

  • ClusterD must be installed when you need to use functions of full NPU scheduling, static vNPU scheduling, dynamic vNPU scheduling, resumable training, elastic training, recovery of inference card faults, or rescheduling upon inference card faults. ClusterD can provide full information collection services only when both Ascend Device Plugin and NodeD exist in a cluster.
  • Install Volcano before installing ClusterD. If ClusterD is installed before Volcano, crashLoopBackOff may occur on the pod where ClusterD is located. ClusterD will be restored only after the pod of Volcano is started.
  • If you need only containerization and resource monitoring functions, you do not need to install ClusterD. In this case, skip this section.
  • Before using slow node/network fault detection, install ClusterD by referring to Slow Node and Slow Network Faults.

Procedure

  1. Log in to the Kubernetes management node as the root user and check whether the ClusterD image and version number are correct.
    docker images | grep clusterd
    Command output:
    1
    clusterd                   v7.3.0              c532e9d0889c        About an hour ago         126MB
    
  2. Copy the YAML file in the directory where the ClusterD package is decompressed to any directory on the Kubernetes management node.
  3. Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the ClusterD startup parameters in the YAML file based on your requirements. For details about the startup parameters, see Table 1. You can run the ./clusterd -h command in the directory of the ClusterD binary package to view the parameter description.
  4. Run the following command in the directory where the YAML file of the management node is stored to start ClusterD.
    kubectl apply -f clusterd-v{version}.yaml
    Startup example:
    clusterrolebinding.rbac.authorization.k8s.io/pods-clusterd-rolebinding created
    lease.coordination.k8s.io/cluster-info-collector created
    deployment.apps/clusterd created
    service/clusterd-grpc-svc created
  5. Run the following command to check whether the component is started successfully:
    kubectl get pod -n mindx-dl
    The following is a startup example. If Running is displayed, the component is started successfully.
    NAME                          READY   STATUS              RESTARTS   AGE
    clusterd-7844cb867d-fwcj7     0/1     Running            0          45s

Parameter Description

Table 1 ClusterD startup parameters

Parameter

Type

Default Value

Description

-version

Bool

false

ClusterD version query.

  • true: queries the version.
  • false: does not query the version.

-logLevel

Integer

0

Log level.

  • -1: debug
  • 0: info
  • 1: warning
  • 2: error
  • 3: critical

-maxAge

Integer

7

Time for backing up logs. The value ranges from 7 to 700, in days.

-logFile

String

/var/log/mindx-dl/clusterd/clusterd.log

Log file.

NOTE:

If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. The dump file is named in the format of "clusterd-dump triggering time.log", for example, clusterd-2024-06-07T03-38-24.402.log.

-maxBackups

Integer

30

Maximum number of dumped log files that can be retained. The value ranges from 1 to 30.

-useProxy

Bool

false

Whether to use a proxy to forward gRPC requests.

  • true: yes
  • false: no
    NOTE:

    You are advised to set the parameter value to true in the startup YAML file and perform security hardening on ClusterD. For details, see Hardening ClusterD Security.

-h or --help

None

None

Help information.