NodeD

NodeD must be installed when you need to use functions of full NPU scheduling, static vNPU scheduling, dynamic vNPU scheduling, resumable training, elastic training, recovery of inference card faults, or rescheduling upon inference card faults.
If you need only containerization and resource monitoring functions, you do not need to install NodeD. In this case, skip this section.
Before using slow node/network fault detection, install NodeD by referring to Slow Node and Slow Network Faults.

Procedure

Log in to each compute node as the root user and check whether the image and version are correct.
```
docker images | grep noded
```
Command output:
1
noded v7.3.0 ef801847acd2 29 minutes ago 133MB
- If correct, proceed to Step 2.
- If not correct, create the image and distribute it by referring to Preparing an Image.
Copy the YAML file in the directory where the NodeD package is decompressed to any directory on the Kubernetes management node.
Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the NodeD startup parameters in the YAML file based on your requirements. For details about the startup parameters, see Table 1. You can run the ./noded -h command to view the parameter descriptions.

(Optional) If resumable training or elastic training is used, configure the interval for reporting the node status. Add the -reportInterval parameter to the args line in the NodeD startup YAML file as follows:

...
          env:
            - name: NODE_NAME
              valueFrom:
                fieldRef:
                  fieldPath: spec.nodeName
          imagePullPolicy: Never
          command: [ "/bin/bash", "-c", "--"]
          args: [ "/usr/local/bin/noded -logFile=/var/log/mindx-dl/noded/noded.log -logLevel=0 -reportInterval=5" ]
          securityContext:
            readOnlyRootFilesystem: true
            allowPrivilegeEscalation: true
          volumeMounts:
            - name: log-noded
...

If no response is received from the node within 40 seconds by default, Kubernetes sets the node status to NotReady.
When the request pressure of Kubernetes API server increases, increase the interval based on the actual situation to reduce the API server stress.

Run the following command in the directory where the YAML file of the management node is stored to start NodeD.
- If the DPC fault detection function is not used, run the following command:
```
kubectl apply -f noded-v{version}.yaml
```
- If Scale-Out Storage DPC 24.2.0 or later has been deployed in the environment and the DPC fault detection function is used, run the following command to start NodeD:
```
kubectl apply -f noded-dpc-v{version}.yaml
```
  Startup example:
```
serviceaccount/noded created
clusterrole.rbac.authorization.k8s.io/pods-noded-role created
clusterrolebinding.rbac.authorization.k8s.io/pods-noded-rolebinding created
daemonset.apps/noded created
```

Run the following command to check whether the component is started successfully:

kubectl get pod -n mindx-dl

The following is a startup example. If Running is displayed, the component is started successfully.

NAME                              READY   STATUS    RESTARTS   AGE
...
noded-fd6t8                  1/1    Running  0        74s
...

After the component is installed, if the pod status of the component is not Running, refer to Component pods Are Not in the Running State.
After the component is installed, if the pod status of the component is ContainerCreating, refer to Cluster Scheduling Component Pods Are in the ContainerCreating State.
If the component fails to be started, refer to Cluster Scheduling Components Fail to Start and "get sem errno =13" Is Displayed in Logs.
If the component is started successfully, but the corresponding pod cannot be found, refer to YAML File for Starting a Component Is Successfully Executed, But the pod Corresponding to the Component Is Not Displayed.

Parameters

**Table 1** NodeD startup parameters
Parameter	Type	Default Value	Description
-reportInterval	Integer	5	Minimum interval for reporting the node fault information. If the node status changes, it will be reported within 5 seconds. If the node status has not changed for a long time, the reporting interval is 30 minutes. The value ranges from 1 to 300, in seconds. When the request pressure of Kubernetes API server increases, increase the interval based on the actual situation to reduce the API server stress.
-monitorPeriod	Integer	60	Interval for checking node hardware faults. The value ranges from 60 to 600, in seconds.
-version	Bool	false	Whether to query the NodeD version number. true: queries the version. false: does not query the version.
-logLevel	Integer	0	Log level: -1: debug 0: info 1: warning 2: error 3: critical
-maxAge	Integer	7	Time for backing up logs. The value ranges from 7 to 700, in days.
-resultMaxAge	Integer	7	Number of days for storing pingmesh result backup files. The value ranges from 7 to 700, in days. NOTE: This parameter is supported only on Atlas 900 A3 SuperPoD. The driver version must be 24.1.RC1 or later.
-logFile	String	/var/log/mindx-dl/noded/noded.log	Log file. NOTE: If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. Dumped files are named in the format of "noded-dump triggering time.log", for example, noded-2023-10-07T03-38-24.402.log.
-maxBackups	Integer	30	Maximum number of dumped log files that can be retained. The value ranges from 1 to 30.
-deviceResetTimeout	Integer	60	Maximum wait time for the driver to report complete processor information if the detected processor count is insufficient at component startup. The value ranges from 10 to 600, in seconds. For the Atlas A2 training product, Atlas 800I A2 inference server, and A200I A2 Box heterogeneous component, the recommended value is 150 seconds. For the Atlas A3 training product, A200T A3 Box8 SuperPoD Server, and Atlas 800I A3 SuperPoD Server, the recommended value is 360 seconds.
-h or --help	None	None	Help information.

Parent topic: Manual Installation