NodeD

  • NodeD must be installed when you need to use functions of full NPU scheduling, static vNPU scheduling, dynamic vNPU scheduling, resumable training, elastic training, recovery of inference card faults, or rescheduling upon inference card faults.
  • If you need only containerization and resource monitoring functions, you do not need to install NodeD. In this case, skip this section.
  • Before using slow node/network fault detection, install NodeD by referring to Slow Node and Slow Network Faults.

Procedure

  1. Log in to each compute node as the root user and check whether the image and version are correct.
    docker images | grep noded

    Command output:

    1
    noded                               v7.3.0              ef801847acd2        29 minutes ago      133MB
    
  2. Copy the YAML file in the directory where the NodeD package is decompressed to any directory on the Kubernetes management node.
  3. Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the NodeD startup parameters in the YAML file based on your requirements. For details about the startup parameters, see Table 1. You can run the ./noded -h command to view the parameter descriptions.
  4. (Optional) If resumable training or elastic training is used, configure the interval for reporting the node status. Add the -reportInterval parameter to the args line in the NodeD startup YAML file as follows:
    ...
              env:
                - name: NODE_NAME
                  valueFrom:
                    fieldRef:
                      fieldPath: spec.nodeName
              imagePullPolicy: Never
              command: [ "/bin/bash", "-c", "--"]
              args: [ "/usr/local/bin/noded -logFile=/var/log/mindx-dl/noded/noded.log -logLevel=0 -reportInterval=5" ]
              securityContext:
                readOnlyRootFilesystem: true
                allowPrivilegeEscalation: true
              volumeMounts:
                - name: log-noded
    ...
  5. Run the following command in the directory where the YAML file of the management node is stored to start NodeD.
    • If the DPC fault detection function is not used, run the following command:
      kubectl apply -f noded-v{version}.yaml
    • If Scale-Out Storage DPC 24.2.0 or later has been deployed in the environment and the DPC fault detection function is used, run the following command to start NodeD:
      kubectl apply -f noded-dpc-v{version}.yaml

      Startup example:

      serviceaccount/noded created
      clusterrole.rbac.authorization.k8s.io/pods-noded-role created
      clusterrolebinding.rbac.authorization.k8s.io/pods-noded-rolebinding created
      daemonset.apps/noded created
  6. Run the following command to check whether the component is started successfully:
    kubectl get pod -n mindx-dl

    The following is a startup example. If Running is displayed, the component is started successfully.

    1
    2
    3
    4
    NAME                              READY   STATUS    RESTARTS   AGE
    ...
    noded-fd6t8                  1/1    Running  0        74s
    ...
    

Parameters

Table 1 NodeD startup parameters

Parameter

Type

Default Value

Description

-reportInterval

Integer

5

  • Minimum interval for reporting the node fault information. If the node status changes, it will be reported within 5 seconds. If the node status has not changed for a long time, the reporting interval is 30 minutes.
  • The value ranges from 1 to 300, in seconds.
  • When the request pressure of Kubernetes API server increases, increase the interval based on the actual situation to reduce the API server stress.

-monitorPeriod

Integer

60

Interval for checking node hardware faults. The value ranges from 60 to 600, in seconds.

-version

Bool

false

Whether to query the NodeD version number.

  • true: queries the version.
  • false: does not query the version.

-logLevel

Integer

0

Log level:

  • -1: debug
  • 0: info
  • 1: warning
  • 2: error
  • 3: critical

-maxAge

Integer

7

Time for backing up logs. The value ranges from 7 to 700, in days.

-resultMaxAge

Integer

7

Number of days for storing pingmesh result backup files. The value ranges from 7 to 700, in days.

NOTE:

This parameter is supported only on Atlas 900 A3 SuperPoD. The driver version must be 24.1.RC1 or later.

-logFile

String

/var/log/mindx-dl/noded/noded.log

Log file.

NOTE:

If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. Dumped files are named in the format of "noded-dump triggering time.log", for example, noded-2023-10-07T03-38-24.402.log.

-maxBackups

Integer

30

Maximum number of dumped log files that can be retained. The value ranges from 1 to 30.

-deviceResetTimeout

Integer

60

Maximum wait time for the driver to report complete processor information if the detected processor count is insufficient at component startup. The value ranges from 10 to 600, in seconds.

  • For the Atlas A2 training product, Atlas 800I A2 inference server, and A200I A2 Box heterogeneous component, the recommended value is 150 seconds.
  • For the Atlas A3 training product, A200T A3 Box8 SuperPoD Server, and Atlas 800I A3 SuperPoD Server, the recommended value is 360 seconds.

-h or --help

None

None

Help information.