Slow Network Diagnosis

Description

This feature provides parameter plane connectivity checks, real-time monitoring, and proactive risk warnings. By streamlining fault diagnostics and demarcation, it pre-warns network issues and sub-healthy faults and ensures the long-term stability of the cluster network.

Currently, only integration with ClusterD and NodeD for online deployment is supported. For details about how to deploy ClusterD and NodeD, see Installation and Deployment.

  • Slow network algorithm: Analyzes and detects network dialing test data between nodes and outputs network diagnosis results.
  • Slow network scheduling: Manages the lifecycle of detection tasks, reports fault results, and schedules the overall slow network process.

Usage Example

  1. Configure shared storage.
    ClusterD and NodeD interact with each other through shared storage, requiring identical root paths. UID 9000 owns the root path of the shared directory, matching the running user of ClusterD.
    1. Configure the server.

    2. Modify NodeD configurations.

    3. Modify ClusterD configurations.

    4. Run the kubectl get pods -o -wide -A command. If information similar to the following is displayed, the shared storage configuration is complete.

  2. Enable fault detection.
    1. Log in to the environment and navigate to the NodeD extraction directory.
    2. Create a ConfigMap file named pingmesh-config. pingmesh-config.yaml is the configuration file of pingmesh, which can be obtained from the NodeD installation package.
      kubectl apply -f pingmesh-config.yaml

      Command output:

      configmap/pingmesh-config created
    3. Edit the pingmesh-config file. Refer to the following tables to learn parameters in the file.
      kubectl edit cm -n cluster-system   pingmesh-config
      Table 1 Parameters in the pingmesh-config file

      Parameter

      Value

      Description

      app

      pingmesh

      Key of a label in the ConfigMap

      global

      -

      Cluster configuration

      "1"

      SuperPoD ID

      1 is used as an example. You can modify or add the configuration as required. After a SuperPoD is configured, NodeD will use its configuration, instead of the global configuration.

      activate

      • on: enabled
      • off: disabled

      Whether to enable pingmesh

      task_interval

      [1–60]

      Interval for executing a pingmesh task, in seconds.

Checking the Results

Network detection pingmesh results are logged to the <nodename>.log file. The following table describes the fields in the file.

Table 2 Parameters in <nodename>.log

Parameter

Value Range

Description

uid

A string of 64 characters

ID of the pingmesh task

config

String

User configuration of the pingmesh task

physicID

[0–15]

Physical ID of the NPU

taskID

  • 0: intra-node task
  • 1: inter-node task

Task ID

DestNum

[0–47]

Number of target addresses in the pingmesh task

source_addr

IPv4 network address

Source address

target_addr

IPv4 network address

Destination address

suc_pkt_num

-

Number of packets that are successfully sent

fail_pkt_num

-

Number of packets that fail to be sent

max_time

  • Non-negative value in normal cases
  • -1 for ping failure

Maximum response time

min_time

  • Non-negative value in normal cases
  • -1 for ping failure

Minimum response time

avg_time

  • Non-negative value in normal cases
  • -1 for ping failure

Average response time

tp95_time

  • Non-negative value in normal cases
  • -1 for ping failure

Response time at TP 95

reply_stat_num

-

Number of responses found in this query

ping_total_num

-

Total number of responses in this task

Checking gRPC Reporting Results

If a slow network fault is detected, the fault is reported to the public fault management center of ClusterD through gRPC.

The ConfigMap file displays related information, which is automatically cleared 5 seconds later.

Known Slow Network Faults

Fault Code

Fault Description

Fault Level

200001010

Slow network detected/recovered in a node

NotHandleFault

200001011

Inter-node slow network detected/recovered in a SuperPoD

NotHandleFault

200001012

Slow network not caused by a card fault

NotHandleFault