Slow Network Diagnosis

Description

This feature provides parameter plane connectivity checks, real-time monitoring, and proactive risk warnings. By streamlining fault diagnostics and demarcation, it pre-warns network issues and sub-healthy faults and ensures the long-term stability of the cluster network.

Currently, only integration with ClusterD and NodeD for online deployment is supported. For details about how to deploy ClusterD and NodeD, see Installation and Deployment.

Slow network algorithm: Analyzes and detects network dialing test data between nodes and outputs network diagnosis results.
Slow network scheduling: Manages the lifecycle of detection tasks, reports fault results, and schedules the overall slow network process.

Usage Example

Configure shared storage.
ClusterD and NodeD interact with each other through shared storage, requiring identical root paths. UID 9000 owns the root path of the shared directory, matching the running user of ClusterD.
1. Configure the server.
2. Modify NodeD configurations.
3. Modify ClusterD configurations.
4. Run the kubectl get pods -o -wide -A command. If information similar to the following is displayed, the shared storage configuration is complete.

Enable fault detection.

Log in to the environment and navigate to the NodeD extraction directory.
Create a ConfigMap file named pingmesh-config. pingmesh-config.yaml is the configuration file of pingmesh, which can be obtained from the NodeD installation package.
```
kubectl apply -f pingmesh-config.yaml
```
Command output:
```
configmap/pingmesh-config created
```

Edit the pingmesh-config file. Refer to the following tables to learn parameters in the file.

kubectl edit cm -n cluster-system   pingmesh-config

**Table 1** Parameters in the pingmesh-config file
Parameter	Value	Description
app	pingmesh	Key of a label in the ConfigMap
global	-	Cluster configuration
"1"	SuperPoD ID	1 is used as an example. You can modify or add the configuration as required. After a SuperPoD is configured, NodeD will use its configuration, instead of the global configuration.
activate	on: enabled off: disabled	Whether to enable pingmesh
task_interval	[1–60]	Interval for executing a pingmesh task, in seconds.

Checking the Results

Network detection pingmesh results are logged to the <nodename>.log file. The following table describes the fields in the file.

**Table 2** Parameters in <nodename>.log
Parameter	Value Range	Description
uid	A string of 64 characters	ID of the pingmesh task
config	String	User configuration of the pingmesh task
physicID	[0–15]	Physical ID of the NPU
taskID	0: intra-node task 1: inter-node task	Task ID
DestNum	[0–47]	Number of target addresses in the pingmesh task
source_addr	IPv4 network address	Source address
target_addr	IPv4 network address	Destination address
suc_pkt_num	-	Number of packets that are successfully sent
fail_pkt_num	-	Number of packets that fail to be sent
max_time	Non-negative value in normal cases -1 for ping failure	Maximum response time
min_time	Non-negative value in normal cases -1 for ping failure	Minimum response time
avg_time	Non-negative value in normal cases -1 for ping failure	Average response time
tp95_time	Non-negative value in normal cases -1 for ping failure	Response time at TP 95
reply_stat_num	-	Number of responses found in this query
ping_total_num	-	Total number of responses in this task

Checking gRPC Reporting Results

If a slow network fault is detected, the fault is reported to the public fault management center of ClusterD through gRPC.

The ConfigMap file displays related information, which is automatically cleared 5 seconds later.

Known Slow Network Faults

Fault Code	Fault Description	Fault Level
200001010	Slow network detected/recovered in a node	NotHandleFault
200001011	Inter-node slow network detected/recovered in a SuperPoD	NotHandleFault
200001012	Slow network not caused by a card fault	NotHandleFault

Parent topic: Slow Node and Slow Network Faults