Pingmesh UnifiedBus Network Faults

Refer to NPU network faults detected on the HCCS network within or across SuperPoDs.

Reporting Mechanism

NodeD calls the DCMI to start a pingmesh task, periodically queries the pingmesh result, and writes the result to the <nodename>.log file. By default, the file is stored in /user/mind-cluster/pingmesh both within the container and on the physical machine. However, the path on the physical machine can be changed as follows.

  • <nodename> is not a fixed value and is the node name queried in Kubernetes.
  • The path of <nodename>.log on the physical machine can be configured as required. That is, modify the mount path of the physical machine whose volume name is pingmesh-result in the NodeD startup YAML file.

After the pingmesh result is obtained, ClusterD preliminarily analyzes the result and writes the fault information to the ConfigMap file named pingmesh-fault-<nodename>. ClusterD listens to the HCCS information, and summarizes and reports the fault information to Volcano for scheduling.

Prerequisites

Restrictions

This function is supported only by the Atlas 900 A3 SuperPoD.

Configuring UnifiedBus Network Detection

To enable or disable UnifiedBus network detection, perform the following steps:

  1. Configure shared storage.
    ClusterD and NodeD interact with each other through shared storage, requiring identical root paths. UID 9000 owns the root path of the shared directory, matching the running user of ClusterD.
    1. Configure the server.

    2. Modify NodeD configurations.

    3. If ClusterD exists, modify ClusterD configurations.

    4. Run the kubectl get pods -o -wide -A command. If information similar to the following is displayed, the shared storage configuration is complete.

  2. Enable or disable UnifiedBus network detection.
    • (Recommended) Ascend Device Plugin and ClusterD installed
      1. Log in to the environment and go to the directory generated after NodeD decompression.
      2. Create the ConfigMap file named pingmesh-config.
        pingmesh-config.yaml is the configuration file of pingmesh, which can be obtained from the NodeD installation package.
        kubectl apply -f pingmesh-config.yaml  
        Command output:
        1
        configmap/pingmesh-config created
        
      3. Edit the pingmesh-config file. Table 1 describes the parameters in the file.
        kubectl edit cm -n cluster-system   pingmesh-config
        Table 1 pingmesh-config cm

        Parameter

        Description

        Value

        app

        Key of one of the labels of ConfigMap

        pingmesh

        global

        Cluster configuration

        -

        "1"

        Configuration example when the SuperPoD ID is 1. You can modify or add the configuration as required. After a SuperPoD is configured, NodeD uses the SuperPoD configuration information and ignores the information specified by global.

        SuperPoD ID

        activate

        Whether to enable pingmesh

        on or off

        task_interval

        Pingmesh task interval, in seconds.

        [1–60]

    • Ascend Device Plugin and ClusterD not installed

      Generate a namespace cluster-system and a ConfigMap whose name is super-pod-<superPodID> and label is app=pingmesh. Fields in the ConfigMap must be set according to Table 5. Example:

      apiVersion: v1
      data:
        superPodDevice: '{"SuperPodID":"0","NodeDeviceMap":{"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"62914560","1":"62980097","10":"64225290","11":"64290827","12":"64487436","13":"64552973","14":"64749582","15":"64815119","2":"63176706","3":"63242243","4":"63438852","5":"63504389","6":"63700998","7":"63766535","8":"63963144","9":"64028681"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"67108864","1":"67174401","10":"68419594","11":"68485131","12":"68681740","13":"68747277","14":"68943886","15":"69009423","2":"67371010","3":"67436547","4":"67633156","5":"67698693","6":"67895302","7":"67960839","8":"68157448","9":"68222985"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"104857600","1":"104923137","10":"106168330","11":"106233867","12":"106430476","13":"106496013","14":"106692622","15":"106758159","2":"105119746","3":"105185283","4":"105381892","5":"105447429","6":"105644038","7":"105709575","8":"105906184","9":"105971721"}},"node-**-*":{"NodeName":"node-**-*","DeviceMap":{"0":"4194304","1":"4259841","10":"5505034","11":"5570571","12":"5767180","13":"5832717","14":"6029326","15":"6094863","2":"4456450","3":"4521987","4":"4718596","5":"4784133","6":"4980742","7":"5046279","8":"5242888","9":"5308425"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"142606336","1":"142671873","10":"143917066","11":"143982603","12":"144179212","13":"144244749","14":"144441358","15":"144506895","2":"142868482","3":"142934019","4":"143130628","5":"143196165","6":"143392774","7":"143458311","8":"143654920","9":"143720457"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"146800640","1":"146866177","10":"148111370","11":"148176907","12":"148373516","13":"148439053","14":"148635662","15":"148701199","2":"147062786","3":"147128323","4":"147324932","5":"147390469","6":"147587078","7":"147652615","8":"147849224","9":"147914761"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"83886080","1":"83951617","10":"85196810","11":"85262347","12":"85458956","13":"85524493","14":"85721102","15":"85786639","2":"84148226","3":"84213763","4":"84410372","5":"84475909","6":"84672518","7":"84738055","8":"84934664","9":"85000201"}}}}'
      kind: ConfigMap
      metadata:
        labels:
          app: pingmesh
        name: super-pod-0       # 0 indicates the SuperPoD ID.
        namespace: cluster-system

Viewing Detection Results

The detection result query period is 10 times the value of task_interval.

The pingmesh detection results of UnifiedBus network are written into the <nodename>.log file. The following table describes the fields in the file.

Table 2 <nodename>.log

Parameter

Description

Value

uid

ID of the pingmesh task

A string of 64 characters

config

User configuration of the pingmesh task

String

physicID

Physical ID of the NPU

[0–15]

taskID

Task ID. The value 0 indicates intra-node, and the value 1 indicates inter-node.

0 or 1

DestNum

Number of target addresses of the pingmesh task

[0–47]

source_addr

Source address

IPv4 network address

target_addr

Target address

IPv4 network address

suc_pkt_num

Number of successfully sent packets

-

fail_pkt_num

Number of packets that fail to be sent

-

max_time

Maximum response time

  • -1 when the ping operation fails.
  • Non-generative number in normal cases.

min_time

Minimum response time

  • -1 when the ping operation fails.
  • Non-generative number in normal cases.

avg_time

Average response time

  • -1 when the ping operation fails.
  • Non-generative number in normal cases.

tp95_time

Response time at the 95th percentile

  • -1 when the ping operation fails.
  • Non-generative number in normal cases.

reply_stat_num

Number of responses returned in a query

-

ping_total_num

Total number of responses received during the task

-

Viewing Fault Information

Run the following command on the management node to view the fault information of UnifiedBus network detection:

kubectl describe cm -n cluster-system  pingmesh-fault-<nodename>

The following table describes the fields in fault information.

Table 3 pingmesh-fault-<nodename>

Parameter

Description

Value

mc-consumer-publicfault

label key required for ClusterD listening

true

PublicFault

Key of the public fault information

For details, see Table 2.

Known UnifiedBus Network Fault

Fault Code

Fault Description

Fault Level

220001001

NPU HCCS network fault

SeparateNPU

NOTE:

This fault level cannot be configured.