昇腾社区首页
中文
注册

总线设备网络故障

总线设备网络故障是针对超节点内部(包括节点内和节点间)的HCCS网络提供的NPU网络故障检测。

上报机制

NodeD调用DCMI接口启动pingmesh任务,并周期性查询pingmesh结果,将该结果写入文件<nodename>.log。该文件所在目录在容器中为固定路径:/user/mind-cluster/pingmesh,物理机默认目录/user/mind-cluster/pingmesh。物理机路径可以修改,修改方式如以下说明所示。

  • <nodename>非固定值,为K8s中查询到的节点名称。
  • <nodename>.log文件物理机路径可由用户根据实际情况自行配置:在NodeD的启动YAML中修改挂载卷名称为pingmesh-result的物理机挂载路径。

获取pingmesh结果后,ClusterD会对结果进行初步分析,将故障信息写入到名为pingmesh-fault-<nodename>的ConfigMap文件中。ClusterD会侦听该ConfigMap信息,并将故障汇总后上报给Volcano,由Volcano进行调度。

前提条件

使用约束

本功能仅支持在以下产品型号中使用:Atlas 900 A3 SuperPoD 超节点Atlas 9000 A3 SuperPoD 集群算力系统

配置总线设备网络检测

启用或关闭总线设备网络检测,需执行以下步骤。

  • (推荐)已安装Ascend Device PluginClusterD
    1. 登录环境,进入NodeD解压目录。
    2. 执行以下命令创建名为pingmesh-config的ConfigMap文件。
      pingmesh-config.yaml为pingmesh配置文件,可从NodeD安装包中获取。
      kubectl apply -f pingmesh-config.yaml  
      回显示例如下。
      1
      configmap/pingmesh-config created
      
    3. 执行以下命令编辑pingmesh-config文件。该文件中各参数的填写说明如表1所示。
      kubectl edit cm -n cluster-system   pingmesh-config
      表1 pingmesh-config cm

      参数

      说明

      取值

      app

      ConfigMap其中一个label的key。

      pingmesh

      global

      集群配置信息

      -

      "1"

      超节点ID为1的配置示例,用户可根据实际情况进行修改或新增。当配置了某个超节点后,NodeD会采用超节点的配置信息而忽略global配置信息。

      超节点ID

      activate

      是否启用pingmesh功能。

      on或off

      task_interval

      pingmesh任务间隔。单位为秒。

      [1~60]

  • 未安装Ascend Device PluginClusterD

    自行生成名为cluster-system的命名空间, name为super-pod-<superPodID>、label为app=pingmesh的ConfigMap。且该ConfigMap中各字段需按照表5填写。示例如下。

    apiVersion: v1
    data:
      superPodDevice: '{"SuperPodID":"0","NodeDeviceMap":{"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"62914560","1":"62980097","10":"64225290","11":"64290827","12":"64487436","13":"64552973","14":"64749582","15":"64815119","2":"63176706","3":"63242243","4":"63438852","5":"63504389","6":"63700998","7":"63766535","8":"63963144","9":"64028681"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"67108864","1":"67174401","10":"68419594","11":"68485131","12":"68681740","13":"68747277","14":"68943886","15":"69009423","2":"67371010","3":"67436547","4":"67633156","5":"67698693","6":"67895302","7":"67960839","8":"68157448","9":"68222985"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"104857600","1":"104923137","10":"106168330","11":"106233867","12":"106430476","13":"106496013","14":"106692622","15":"106758159","2":"105119746","3":"105185283","4":"105381892","5":"105447429","6":"105644038","7":"105709575","8":"105906184","9":"105971721"}},"node-**-*":{"NodeName":"node-**-*","DeviceMap":{"0":"4194304","1":"4259841","10":"5505034","11":"5570571","12":"5767180","13":"5832717","14":"6029326","15":"6094863","2":"4456450","3":"4521987","4":"4718596","5":"4784133","6":"4980742","7":"5046279","8":"5242888","9":"5308425"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"142606336","1":"142671873","10":"143917066","11":"143982603","12":"144179212","13":"144244749","14":"144441358","15":"144506895","2":"142868482","3":"142934019","4":"143130628","5":"143196165","6":"143392774","7":"143458311","8":"143654920","9":"143720457"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"146800640","1":"146866177","10":"148111370","11":"148176907","12":"148373516","13":"148439053","14":"148635662","15":"148701199","2":"147062786","3":"147128323","4":"147324932","5":"147390469","6":"147587078","7":"147652615","8":"147849224","9":"147914761"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"83886080","1":"83951617","10":"85196810","11":"85262347","12":"85458956","13":"85524493","14":"85721102","15":"85786639","2":"84148226","3":"84213763","4":"84410372","5":"84475909","6":"84672518","7":"84738055","8":"84934664","9":"85000201"}}}}'
    kind: ConfigMap
    metadata:
      labels:
        app: pingmesh
      name: super-pod-0       # 0为超节点ID
      namespace: cluster-system

查看检测结果信息

总线设备网络检测的pingmesh结果写入文件<nodename>.log中。该文件中各字段的详细说明如下表所示。

表2 <nodename>.log

参数

说明

取值

uid

该次pingmesh任务的ID。

长度为64的字符串

config

该次pingmesh任务的用户配置。

字符串

physicID

NPU卡物理ID。

[0~15]

taskID

任务ID,0代表节点内部、1代表节点间。

0或1

DestNum

本次pingmesh目标地址数量。

[0~47]

source_addr

源地址

ipv4网络地址

target_addr

目标地址

ipv4网络地址

suc_pkt_num

发送成功的包数量。

-

fail_pkt_num

发送失败的包数量。

-

max_time

最长响应时间

  • ping失败的时候,值为-1。
  • 正常情况下为非负值。

min_time

最短响应时间

  • ping失败的时候,值为-1。
  • 正常情况下为非负值。

avg_time

平均响应时间

  • ping失败的时候,值为-1。
  • 正常情况下为非负值。

tp95_time

处于95%位置的响应时间。

  • ping失败的时候,值为-1。
  • 正常情况下为非负值。

reply_stat_num

本次查询到的响应数量。

-

ping_total_num

本次任务累计的响应数量。

-

查看故障信息

在管理节点上执行以下命令,查看总线设备网络检测的故障信息。

kubectl describe cm -n cluster-system  pingmesh-fault-<nodename>

故障信息中各字段的详细说明如下所示。

表3 pingmesh-fault-<nodename>

参数

说明

取值

mc-consumer-publicfault

ClusterD侦听所需的label key

true

PublicFault

公共故障信息key

详细说明请参见表2

已支持的总线设备网络故障

故障码

故障说明

故障级别

220001001

NPU卡HCCS网络故障

SeparateNPU

说明:

该故障级别不支持自行配置。