Pingmesh UnifiedBus Network Faults
Refer to NPU network faults detected on the HCCS network within or across SuperPoDs.
Reporting Mechanism
NodeD calls the DCMI to start a pingmesh task, periodically queries the pingmesh result, and writes the result to the <nodename>.log file. By default, the file is stored in /user/mind-cluster/pingmesh both within the container and on the physical machine. However, the path on the physical machine can be changed as follows.
- <nodename> is not a fixed value and is the node name queried in Kubernetes.
- The path of <nodename>.log on the physical machine can be configured as required. That is, modify the mount path of the physical machine whose volume name is pingmesh-result in the NodeD startup YAML file.
After the pingmesh result is obtained, ClusterD preliminarily analyzes the result and writes the fault information to the ConfigMap file named pingmesh-fault-<nodename>. ClusterD listens to the HCCS information, and summarizes and reports the fault information to Volcano for scheduling.
Prerequisites
- (Required) You have created a namespace.
- You have installed NodeD (mandatory), Ascend Device Plugin (optional), and ClusterD (optional) on the corresponding node.
- (Required) You have set the resultMaxAge parameter of NodeD.
Restrictions
This function is supported only by the Atlas 900 A3 SuperPoD.
Configuring UnifiedBus Network Detection
To enable or disable UnifiedBus network detection, perform the following steps:
- Configure shared storage.ClusterD and NodeD interact with each other through shared storage, requiring identical root paths. UID 9000 owns the root path of the shared directory, matching the running user of ClusterD.
- Enable or disable UnifiedBus network detection.
- (Recommended) Ascend Device Plugin and ClusterD installed
- Log in to the environment and go to the directory generated after NodeD decompression.
- Create the ConfigMap file named pingmesh-config.pingmesh-config.yaml is the configuration file of pingmesh, which can be obtained from the NodeD installation package.
kubectl apply -f pingmesh-config.yaml
Command output:1configmap/pingmesh-config created
- Edit the pingmesh-config file. Table 1 describes the parameters in the file.
kubectl edit cm -n cluster-system pingmesh-config
Table 1 pingmesh-config cm Parameter
Description
Value
app
Key of one of the labels of ConfigMap
pingmesh
global
Cluster configuration
-
"1"
Configuration example when the SuperPoD ID is 1. You can modify or add the configuration as required. After a SuperPoD is configured, NodeD uses the SuperPoD configuration information and ignores the information specified by global.
SuperPoD ID
activate
Whether to enable pingmesh
on or off
task_interval
Pingmesh task interval, in seconds.
[1–60]
- Ascend Device Plugin and ClusterD not installed
Generate a namespace cluster-system and a ConfigMap whose name is super-pod-<superPodID> and label is app=pingmesh. Fields in the ConfigMap must be set according to Table 5. Example:
apiVersion: v1 data: superPodDevice: '{"SuperPodID":"0","NodeDeviceMap":{"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"62914560","1":"62980097","10":"64225290","11":"64290827","12":"64487436","13":"64552973","14":"64749582","15":"64815119","2":"63176706","3":"63242243","4":"63438852","5":"63504389","6":"63700998","7":"63766535","8":"63963144","9":"64028681"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"67108864","1":"67174401","10":"68419594","11":"68485131","12":"68681740","13":"68747277","14":"68943886","15":"69009423","2":"67371010","3":"67436547","4":"67633156","5":"67698693","6":"67895302","7":"67960839","8":"68157448","9":"68222985"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"104857600","1":"104923137","10":"106168330","11":"106233867","12":"106430476","13":"106496013","14":"106692622","15":"106758159","2":"105119746","3":"105185283","4":"105381892","5":"105447429","6":"105644038","7":"105709575","8":"105906184","9":"105971721"}},"node-**-*":{"NodeName":"node-**-*","DeviceMap":{"0":"4194304","1":"4259841","10":"5505034","11":"5570571","12":"5767180","13":"5832717","14":"6029326","15":"6094863","2":"4456450","3":"4521987","4":"4718596","5":"4784133","6":"4980742","7":"5046279","8":"5242888","9":"5308425"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"142606336","1":"142671873","10":"143917066","11":"143982603","12":"144179212","13":"144244749","14":"144441358","15":"144506895","2":"142868482","3":"142934019","4":"143130628","5":"143196165","6":"143392774","7":"143458311","8":"143654920","9":"143720457"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"146800640","1":"146866177","10":"148111370","11":"148176907","12":"148373516","13":"148439053","14":"148635662","15":"148701199","2":"147062786","3":"147128323","4":"147324932","5":"147390469","6":"147587078","7":"147652615","8":"147849224","9":"147914761"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"83886080","1":"83951617","10":"85196810","11":"85262347","12":"85458956","13":"85524493","14":"85721102","15":"85786639","2":"84148226","3":"84213763","4":"84410372","5":"84475909","6":"84672518","7":"84738055","8":"84934664","9":"85000201"}}}}' kind: ConfigMap metadata: labels: app: pingmesh name: super-pod-0 # 0 indicates the SuperPoD ID. namespace: cluster-system
- (Recommended) Ascend Device Plugin and ClusterD installed
Viewing Detection Results
The detection result query period is 10 times the value of task_interval.
The pingmesh detection results of UnifiedBus network are written into the <nodename>.log file. The following table describes the fields in the file.
Parameter |
Description |
Value |
|---|---|---|
uid |
ID of the pingmesh task |
A string of 64 characters |
config |
User configuration of the pingmesh task |
String |
physicID |
Physical ID of the NPU |
[0–15] |
taskID |
Task ID. The value 0 indicates intra-node, and the value 1 indicates inter-node. |
0 or 1 |
DestNum |
Number of target addresses of the pingmesh task |
[0–47] |
source_addr |
Source address |
IPv4 network address |
target_addr |
Target address |
IPv4 network address |
suc_pkt_num |
Number of successfully sent packets |
- |
fail_pkt_num |
Number of packets that fail to be sent |
- |
max_time |
Maximum response time |
|
min_time |
Minimum response time |
|
avg_time |
Average response time |
|
tp95_time |
Response time at the 95th percentile |
|
reply_stat_num |
Number of responses returned in a query |
- |
ping_total_num |
Total number of responses received during the task |
- |
Viewing Fault Information
Run the following command on the management node to view the fault information of UnifiedBus network detection:
kubectl describe cm -n cluster-system pingmesh-fault-<nodename>
The following table describes the fields in fault information.
Parameter |
Description |
Value |
|---|---|---|
mc-consumer-publicfault |
label key required for ClusterD listening |
true |
PublicFault |
Key of the public fault information |
For details, see Table 2. |
Known UnifiedBus Network Fault
Fault Code |
Fault Description |
Fault Level |
|---|---|---|
220001001 |
NPU HCCS network fault |
SeparateNPU NOTE:
This fault level cannot be configured. |



