ClusterD

ClusterD must be installed when you need to use functions of full NPU scheduling, static vNPU scheduling, dynamic vNPU scheduling, resumable training, elastic training, recovery of inference card faults, or rescheduling upon inference card faults. ClusterD can provide full information collection services only when both Ascend Device Plugin and NodeD exist in a cluster.
Install Volcano before installing ClusterD. If ClusterD is installed before Volcano, crashLoopBackOff may occur on the pod where ClusterD is located. ClusterD will be restored only after the pod of Volcano is started.
If you need only containerization and resource monitoring functions, you do not need to install ClusterD. In this case, skip this section.
Before using slow node/network fault detection, install ClusterD by referring to Slow Node and Slow Network Faults.

Procedure

Log in to the Kubernetes management node as the root user and check whether the ClusterD image and version number are correct.
```
docker images | grep clusterd
```
Command output:
1
clusterd v7.3.0 c532e9d0889c About an hour ago 126MB
- If correct, proceed to Step 2.
- If not correct, create the image and distribute it by referring to Preparing an Image.
Copy the YAML file in the directory where the ClusterD package is decompressed to any directory on the Kubernetes management node.
Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the ClusterD startup parameters in the YAML file based on your requirements. For details about the startup parameters, see Table 1. You can run the ./clusterd -h command in the directory of the ClusterD binary package to view the parameter description.

Run the following command in the directory where the YAML file of the management node is stored to start ClusterD.

kubectl apply -f clusterd-v{version}.yaml

Startup example:

clusterrolebinding.rbac.authorization.k8s.io/pods-clusterd-rolebinding created
lease.coordination.k8s.io/cluster-info-collector created
deployment.apps/clusterd created
service/clusterd-grpc-svc created

Run the following command to check whether the component is started successfully:

kubectl get pod -n mindx-dl

The following is a startup example. If Running is displayed, the component is started successfully.

NAME                          READY   STATUS              RESTARTS   AGE
clusterd-7844cb867d-fwcj7     0/1     Running            0          45s

After the component is installed, if the pod status of the component is not Running, refer to Component pods Are Not in the Running State.
After the component is installed, if the pod status of the component is ContainerCreating, refer to Cluster Scheduling Component Pods Are in the ContainerCreating State.
If the component fails to be started, refer to Cluster Scheduling Components Fail to Start and "get sem errno =13" Is Displayed in Logs.
If the component is started successfully, but the corresponding pod cannot be found, refer to YAML File for Starting a Component Is Successfully Executed, But the pod Corresponding to the Component Is Not Displayed.

Parameter Description

**Table 1** ClusterD startup parameters
Parameter	Type	Default Value	Description
-version	Bool	false	ClusterD version query. true: queries the version. false: does not query the version.
-logLevel	Integer	0	Log level. -1: debug 0: info 1: warning 2: error 3: critical
-maxAge	Integer	7	Time for backing up logs. The value ranges from 7 to 700, in days.
-logFile	String	/var/log/mindx-dl/clusterd/clusterd.log	Log file. NOTE: If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. The dump file is named in the format of "clusterd-dump triggering time.log", for example, clusterd-2024-06-07T03-38-24.402.log.
-maxBackups	Integer	30	Maximum number of dumped log files that can be retained. The value ranges from 1 to 30.
-useProxy	Bool	false	Whether to use a proxy to forward gRPC requests. true: yes false: no NOTE: You are advised to set the parameter value to true in the startup YAML file and perform security hardening on ClusterD. For details, see Hardening ClusterD Security.
-h or --help	None	None	Help information.

Parent topic: Manual Installation