NodeD
- NodeD must be installed when you need to use functions of full NPU scheduling, static vNPU scheduling, dynamic vNPU scheduling, resumable training, elastic training, recovery of inference card faults, or rescheduling upon inference card faults.
- If you need only containerization and resource monitoring functions, you do not need to install NodeD. In this case, skip this section.
- Before using slow node/network fault detection, install NodeD by referring to Slow Node and Slow Network Faults.
Procedure
- Log in to each compute node as the root user and check whether the image and version are correct.
docker images | grep noded
Command output:
1noded v7.3.0 ef801847acd2 29 minutes ago 133MB
- If correct, proceed to Step 2.
- If not correct, create the image and distribute it by referring to Preparing an Image.
- Copy the YAML file in the directory where the NodeD package is decompressed to any directory on the Kubernetes management node.
- Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the NodeD startup parameters in the YAML file based on your requirements. For details about the startup parameters, see Table 1. You can run the ./noded -h command to view the parameter descriptions.
- (Optional) If resumable training or elastic training is used, configure the interval for reporting the node status. Add the -reportInterval parameter to the args line in the NodeD startup YAML file as follows:
... env: - name: NODE_NAME valueFrom: fieldRef: fieldPath: spec.nodeName imagePullPolicy: Never command: [ "/bin/bash", "-c", "--"] args: [ "/usr/local/bin/noded -logFile=/var/log/mindx-dl/noded/noded.log -logLevel=0 -reportInterval=5" ] securityContext: readOnlyRootFilesystem: true allowPrivilegeEscalation: true volumeMounts: - name: log-noded ... - Run the following command in the directory where the YAML file of the management node is stored to start NodeD.
- If the DPC fault detection function is not used, run the following command:
kubectl apply -f noded-v{version}.yaml - If Scale-Out Storage DPC 24.2.0 or later has been deployed in the environment and the DPC fault detection function is used, run the following command to start NodeD:
kubectl apply -f noded-dpc-v{version}.yamlStartup example:
serviceaccount/noded created clusterrole.rbac.authorization.k8s.io/pods-noded-role created clusterrolebinding.rbac.authorization.k8s.io/pods-noded-rolebinding created daemonset.apps/noded created
- If the DPC fault detection function is not used, run the following command:
- Run the following command to check whether the component is started successfully:
kubectl get pod -n mindx-dl
The following is a startup example. If Running is displayed, the component is started successfully.
1 2 3 4
NAME READY STATUS RESTARTS AGE ... noded-fd6t8 1/1 Running 0 74s ...
- After the component is installed, if the pod status of the component is not Running, refer to Component pods Are Not in the Running State.
- After the component is installed, if the pod status of the component is ContainerCreating, refer to Cluster Scheduling Component Pods Are in the ContainerCreating State.
- If the component fails to be started, refer to Cluster Scheduling Components Fail to Start and "get sem errno =13" Is Displayed in Logs.
- If the component is started successfully, but the corresponding pod cannot be found, refer to YAML File for Starting a Component Is Successfully Executed, But the pod Corresponding to the Component Is Not Displayed.
Parameters
Parameter |
Type |
Default Value |
Description |
|---|---|---|---|
-reportInterval |
Integer |
5 |
|
-monitorPeriod |
Integer |
60 |
Interval for checking node hardware faults. The value ranges from 60 to 600, in seconds. |
-version |
Bool |
false |
Whether to query the NodeD version number.
|
-logLevel |
Integer |
0 |
Log level:
|
-maxAge |
Integer |
7 |
Time for backing up logs. The value ranges from 7 to 700, in days. |
-resultMaxAge |
Integer |
7 |
Number of days for storing pingmesh result backup files. The value ranges from 7 to 700, in days. NOTE:
This parameter is supported only on Atlas 900 A3 SuperPoD. The driver version must be 24.1.RC1 or later. |
-logFile |
String |
/var/log/mindx-dl/noded/noded.log |
Log file. NOTE:
If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. Dumped files are named in the format of "noded-dump triggering time.log", for example, noded-2023-10-07T03-38-24.402.log. |
-maxBackups |
Integer |
30 |
Maximum number of dumped log files that can be retained. The value ranges from 1 to 30. |
-deviceResetTimeout |
Integer |
60 |
Maximum wait time for the driver to report complete processor information if the detected processor count is insufficient at component startup. The value ranges from 10 to 600, in seconds.
|
-h or --help |
None |
None |
Help information. |