总线设备网络故障
总线设备网络故障是针对超节点内部(包括节点内和节点间)的HCCS网络提供的NPU网络故障检测。
上报机制
NodeD调用DCMI接口启动pingmesh任务,并周期性查询pingmesh结果,将该结果写入文件<nodename>.log。该文件所在目录在容器中为固定路径:/user/mind-cluster/pingmesh,物理机默认目录/user/mind-cluster/pingmesh。物理机路径可以修改,修改方式如以下说明所示。
 - <nodename>非固定值,为K8s中查询到的节点名称。
 - <nodename>.log文件物理机路径可由用户根据实际情况自行配置:在NodeD的启动YAML中修改挂载卷名称为pingmesh-result的物理机挂载路径。
 
获取pingmesh结果后,ClusterD会对结果进行初步分析,将故障信息写入到名为pingmesh-fault-<nodename>的ConfigMap文件中。ClusterD会侦听该ConfigMap信息,并将故障汇总后上报给Volcano,由Volcano进行调度。
前提条件
- (必选)已创建命名空间
 - 在相应节点上完成以下组件的安装:NodeD(必选)、Ascend Device Plugin(可选)、ClusterD(可选)
 - (必选)已配置NodeD启动参数resultMaxAge
 
使用约束
本功能仅支持在以下产品型号中使用:Atlas 900 A3 SuperPoD 超节点、Atlas 9000 A3 SuperPoD 集群算力系统。
配置总线设备网络检测
启用或关闭总线设备网络检测,需执行以下步骤。
- (推荐)已安装Ascend Device Plugin和ClusterD
- 登录环境,进入NodeD解压目录。
 - 执行以下命令创建名为pingmesh-config的ConfigMap文件。
回显示例如下。
1configmap/pingmesh-config created
 - 执行以下命令编辑pingmesh-config文件。该文件中各参数的填写说明如表1所示。
kubectl edit cm -n cluster-system pingmesh-config
 
 - 未安装Ascend Device Plugin和ClusterD
自行生成名为cluster-system的命名空间, name为super-pod-<superPodID>、label为app=pingmesh的ConfigMap。且该ConfigMap中各字段需按照表5填写。示例如下。
apiVersion: v1 data: superPodDevice: '{"SuperPodID":"0","NodeDeviceMap":{"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"62914560","1":"62980097","10":"64225290","11":"64290827","12":"64487436","13":"64552973","14":"64749582","15":"64815119","2":"63176706","3":"63242243","4":"63438852","5":"63504389","6":"63700998","7":"63766535","8":"63963144","9":"64028681"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"67108864","1":"67174401","10":"68419594","11":"68485131","12":"68681740","13":"68747277","14":"68943886","15":"69009423","2":"67371010","3":"67436547","4":"67633156","5":"67698693","6":"67895302","7":"67960839","8":"68157448","9":"68222985"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"104857600","1":"104923137","10":"106168330","11":"106233867","12":"106430476","13":"106496013","14":"106692622","15":"106758159","2":"105119746","3":"105185283","4":"105381892","5":"105447429","6":"105644038","7":"105709575","8":"105906184","9":"105971721"}},"node-**-*":{"NodeName":"node-**-*","DeviceMap":{"0":"4194304","1":"4259841","10":"5505034","11":"5570571","12":"5767180","13":"5832717","14":"6029326","15":"6094863","2":"4456450","3":"4521987","4":"4718596","5":"4784133","6":"4980742","7":"5046279","8":"5242888","9":"5308425"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"142606336","1":"142671873","10":"143917066","11":"143982603","12":"144179212","13":"144244749","14":"144441358","15":"144506895","2":"142868482","3":"142934019","4":"143130628","5":"143196165","6":"143392774","7":"143458311","8":"143654920","9":"143720457"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"146800640","1":"146866177","10":"148111370","11":"148176907","12":"148373516","13":"148439053","14":"148635662","15":"148701199","2":"147062786","3":"147128323","4":"147324932","5":"147390469","6":"147587078","7":"147652615","8":"147849224","9":"147914761"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"83886080","1":"83951617","10":"85196810","11":"85262347","12":"85458956","13":"85524493","14":"85721102","15":"85786639","2":"84148226","3":"84213763","4":"84410372","5":"84475909","6":"84672518","7":"84738055","8":"84934664","9":"85000201"}}}}' kind: ConfigMap metadata: labels: app: pingmesh name: super-pod-0 # 0为超节点ID namespace: cluster-system 
查看检测结果信息
总线设备网络检测的pingmesh结果写入文件<nodename>.log中。该文件中各字段的详细说明如下表所示。
参数  | 
说明  | 
取值  | 
|---|---|---|
uid  | 
该次pingmesh任务的ID。  | 
长度为64的字符串  | 
config  | 
该次pingmesh任务的用户配置。  | 
字符串  | 
physicID  | 
NPU卡物理ID。  | 
[0~15]  | 
taskID  | 
任务ID,0代表节点内部、1代表节点间。  | 
0或1  | 
DestNum  | 
本次pingmesh目标地址数量。  | 
[0~47]  | 
source_addr  | 
源地址  | 
IPv4网络地址  | 
target_addr  | 
目标地址  | 
IPv4网络地址  | 
suc_pkt_num  | 
发送成功的包数量。  | 
-  | 
fail_pkt_num  | 
发送失败的包数量。  | 
-  | 
max_time  | 
最长响应时间  | 
  | 
min_time  | 
最短响应时间  | 
  | 
avg_time  | 
平均响应时间  | 
  | 
tp95_time  | 
处于95%位置的响应时间。  | 
  | 
reply_stat_num  | 
本次查询到的响应数量。  | 
-  | 
ping_total_num  | 
本次任务累计的响应数量。  | 
-  | 
查看故障信息
在管理节点上执行以下命令,查看总线设备网络检测的故障信息。
kubectl describe cm -n cluster-system pingmesh-fault-<nodename>
故障信息中各字段的详细说明如下所示。
参数  | 
说明  | 
取值  | 
|---|---|---|
mc-consumer-publicfault  | 
ClusterD侦听所需的label key  | 
true  | 
PublicFault  | 
公共故障信息key  | 
详细说明请参见表2。  | 
已支持的总线设备网络故障
故障码  | 
故障说明  | 
故障级别  | 
|---|---|---|
220001001  | 
NPU卡HCCS网络故障  | 
SeparateNPU  说明:  
该故障级别不支持自行配置。  |