ClusterD
- ClusterD must be installed when you need to use functions of full NPU scheduling, static vNPU scheduling, dynamic vNPU scheduling, resumable training, elastic training, recovery of inference card faults, or rescheduling upon inference card faults. ClusterD can provide full information collection services only when both Ascend Device Plugin and NodeD exist in a cluster.
- Install Volcano before installing ClusterD. If ClusterD is installed before Volcano, crashLoopBackOff may occur on the pod where ClusterD is located. ClusterD will be restored only after the pod of Volcano is started.
- If you need only containerization and resource monitoring functions, you do not need to install ClusterD. In this case, skip this section.
- Before using slow node/network fault detection, install ClusterD by referring to Slow Node and Slow Network Faults.
Procedure
- Log in to the Kubernetes management node as the root user and check whether the ClusterD image and version number are correct.
docker images | grep clusterd
Command output:1clusterd v7.3.0 c532e9d0889c About an hour ago 126MB
- If correct, proceed to Step 2.
- If not correct, create the image and distribute it by referring to Preparing an Image.
- Copy the YAML file in the directory where the ClusterD package is decompressed to any directory on the Kubernetes management node.
- Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the ClusterD startup parameters in the YAML file based on your requirements. For details about the startup parameters, see Table 1. You can run the ./clusterd -h command in the directory of the ClusterD binary package to view the parameter description.
- Run the following command in the directory where the YAML file of the management node is stored to start ClusterD.
kubectl apply -f clusterd-v{version}.yamlStartup example:clusterrolebinding.rbac.authorization.k8s.io/pods-clusterd-rolebinding created lease.coordination.k8s.io/cluster-info-collector created deployment.apps/clusterd created service/clusterd-grpc-svc created
- Run the following command to check whether the component is started successfully:
kubectl get pod -n mindx-dl
The following is a startup example. If Running is displayed, the component is started successfully.NAME READY STATUS RESTARTS AGE clusterd-7844cb867d-fwcj7 0/1 Running 0 45s
- After the component is installed, if the pod status of the component is not Running, refer to Component pods Are Not in the Running State.
- After the component is installed, if the pod status of the component is ContainerCreating, refer to Cluster Scheduling Component Pods Are in the ContainerCreating State.
- If the component fails to be started, refer to Cluster Scheduling Components Fail to Start and "get sem errno =13" Is Displayed in Logs.
- If the component is started successfully, but the corresponding pod cannot be found, refer to YAML File for Starting a Component Is Successfully Executed, But the pod Corresponding to the Component Is Not Displayed.
Parameter Description
Parameter |
Type |
Default Value |
Description |
|---|---|---|---|
-version |
Bool |
false |
ClusterD version query.
|
-logLevel |
Integer |
0 |
Log level.
|
-maxAge |
Integer |
7 |
Time for backing up logs. The value ranges from 7 to 700, in days. |
-logFile |
String |
/var/log/mindx-dl/clusterd/clusterd.log |
Log file. NOTE:
If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. The dump file is named in the format of "clusterd-dump triggering time.log", for example, clusterd-2024-06-07T03-38-24.402.log. |
-maxBackups |
Integer |
30 |
Maximum number of dumped log files that can be retained. The value ranges from 1 to 30. |
-useProxy |
Bool |
false |
Whether to use a proxy to forward gRPC requests.
|
-h or --help |
None |
None |
Help information. |