Slow Network Diagnosis
Description
This feature provides parameter plane connectivity checks, real-time monitoring, and proactive risk warnings. By streamlining fault diagnostics and demarcation, it pre-warns network issues and sub-healthy faults and ensures the long-term stability of the cluster network.
Currently, only integration with ClusterD and NodeD for online deployment is supported. For details about how to deploy ClusterD and NodeD, see Installation and Deployment.
- Slow network algorithm: Analyzes and detects network dialing test data between nodes and outputs network diagnosis results.
- Slow network scheduling: Manages the lifecycle of detection tasks, reports fault results, and schedules the overall slow network process.
Usage Example
- Configure shared storage.ClusterD and NodeD interact with each other through shared storage, requiring identical root paths. UID 9000 owns the root path of the shared directory, matching the running user of ClusterD.
- Enable fault detection.
- Log in to the environment and navigate to the NodeD extraction directory.
- Create a ConfigMap file named pingmesh-config. pingmesh-config.yaml is the configuration file of pingmesh, which can be obtained from the NodeD installation package.
kubectl apply -f pingmesh-config.yaml
Command output:
configmap/pingmesh-config created
- Edit the pingmesh-config file. Refer to the following tables to learn parameters in the file.
kubectl edit cm -n cluster-system pingmesh-config
Table 1 Parameters in the pingmesh-config file Parameter
Value
Description
app
pingmesh
Key of a label in the ConfigMap
global
-
Cluster configuration
"1"
SuperPoD ID
1 is used as an example. You can modify or add the configuration as required. After a SuperPoD is configured, NodeD will use its configuration, instead of the global configuration.
activate
- on: enabled
- off: disabled
Whether to enable pingmesh
task_interval
[1–60]
Interval for executing a pingmesh task, in seconds.
Checking the Results
Network detection pingmesh results are logged to the <nodename>.log file. The following table describes the fields in the file.
Parameter |
Value Range |
Description |
|---|---|---|
uid |
A string of 64 characters |
ID of the pingmesh task |
config |
String |
User configuration of the pingmesh task |
physicID |
[0–15] |
Physical ID of the NPU |
taskID |
|
Task ID |
DestNum |
[0–47] |
Number of target addresses in the pingmesh task |
source_addr |
IPv4 network address |
Source address |
target_addr |
IPv4 network address |
Destination address |
suc_pkt_num |
- |
Number of packets that are successfully sent |
fail_pkt_num |
- |
Number of packets that fail to be sent |
max_time |
|
Maximum response time |
min_time |
|
Minimum response time |
avg_time |
|
Average response time |
tp95_time |
|
Response time at TP 95 |
reply_stat_num |
- |
Number of responses found in this query |
ping_total_num |
- |
Total number of responses in this task |
Checking gRPC Reporting Results
If a slow network fault is detected, the fault is reported to the public fault management center of ClusterD through gRPC.
The ConfigMap file displays related information, which is automatically cleared 5 seconds later.

Known Slow Network Faults
Fault Code |
Fault Description |
Fault Level |
|---|---|---|
200001010 |
Slow network detected/recovered in a node |
NotHandleFault |
200001011 |
Inter-node slow network detected/recovered in a SuperPoD |
NotHandleFault |
200001012 |
Slow network not caused by a card fault |
NotHandleFault |



