总线设备网络故障是针对超节点内部(包括节点内和节点间)的HCCS网络提供的NPU网络故障检测。
NodeD调用DCMI接口启动pingmesh任务,并周期性查询pingmesh结果,将该结果写入文件<nodename>.log。该文件所在目录在容器中为固定路径:/user/mind-cluster/pingmesh,物理机默认目录/user/mind-cluster/pingmesh。物理机路径可以修改,修改方式如以下说明所示。
获取pingmesh结果后,ClusterD会对结果进行初步分析,将故障信息写入到名为pingmesh-fault-<nodename>的ConfigMap文件中。ClusterD会侦听该ConfigMap信息,并将故障汇总后上报给Volcano,由Volcano进行调度。
本功能仅支持在以下产品型号中使用:Atlas 900 A3 SuperPoD 超节点、Atlas 9000 A3 SuperPoD 集群算力系统。
启用或关闭总线设备网络检测,需执行以下步骤。
1 | configmap/pingmesh-config created |
kubectl edit cm -n cluster-system pingmesh-config
自行生成名为cluster-system的命名空间, name为super-pod-<superPodID>、label为app=pingmesh的ConfigMap。且该ConfigMap中各字段需按照表5填写。示例如下。
apiVersion: v1 data: superPodDevice: '{"SuperPodID":"0","NodeDeviceMap":{"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"62914560","1":"62980097","10":"64225290","11":"64290827","12":"64487436","13":"64552973","14":"64749582","15":"64815119","2":"63176706","3":"63242243","4":"63438852","5":"63504389","6":"63700998","7":"63766535","8":"63963144","9":"64028681"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"67108864","1":"67174401","10":"68419594","11":"68485131","12":"68681740","13":"68747277","14":"68943886","15":"69009423","2":"67371010","3":"67436547","4":"67633156","5":"67698693","6":"67895302","7":"67960839","8":"68157448","9":"68222985"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"104857600","1":"104923137","10":"106168330","11":"106233867","12":"106430476","13":"106496013","14":"106692622","15":"106758159","2":"105119746","3":"105185283","4":"105381892","5":"105447429","6":"105644038","7":"105709575","8":"105906184","9":"105971721"}},"node-**-*":{"NodeName":"node-**-*","DeviceMap":{"0":"4194304","1":"4259841","10":"5505034","11":"5570571","12":"5767180","13":"5832717","14":"6029326","15":"6094863","2":"4456450","3":"4521987","4":"4718596","5":"4784133","6":"4980742","7":"5046279","8":"5242888","9":"5308425"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"142606336","1":"142671873","10":"143917066","11":"143982603","12":"144179212","13":"144244749","14":"144441358","15":"144506895","2":"142868482","3":"142934019","4":"143130628","5":"143196165","6":"143392774","7":"143458311","8":"143654920","9":"143720457"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"146800640","1":"146866177","10":"148111370","11":"148176907","12":"148373516","13":"148439053","14":"148635662","15":"148701199","2":"147062786","3":"147128323","4":"147324932","5":"147390469","6":"147587078","7":"147652615","8":"147849224","9":"147914761"}},"node-**-**":{"NodeName":"node-**-**","DeviceMap":{"0":"83886080","1":"83951617","10":"85196810","11":"85262347","12":"85458956","13":"85524493","14":"85721102","15":"85786639","2":"84148226","3":"84213763","4":"84410372","5":"84475909","6":"84672518","7":"84738055","8":"84934664","9":"85000201"}}}}' kind: ConfigMap metadata: labels: app: pingmesh name: super-pod-0 # 0为超节点ID namespace: cluster-system
总线设备网络检测的pingmesh结果写入文件<nodename>.log中。该文件中各字段的详细说明如下表所示。
参数 |
说明 |
取值 |
---|---|---|
uid |
该次pingmesh任务的ID。 |
长度为64的字符串 |
config |
该次pingmesh任务的用户配置。 |
字符串 |
physicID |
NPU卡物理ID。 |
[0~15] |
taskID |
任务ID,0代表节点内部、1代表节点间。 |
0或1 |
DestNum |
本次pingmesh目标地址数量。 |
[0~47] |
source_addr |
源地址 |
ipv4网络地址 |
target_addr |
目标地址 |
ipv4网络地址 |
suc_pkt_num |
发送成功的包数量。 |
- |
fail_pkt_num |
发送失败的包数量。 |
- |
max_time |
最长响应时间 |
|
min_time |
最短响应时间 |
|
avg_time |
平均响应时间 |
|
tp95_time |
处于95%位置的响应时间。 |
|
reply_stat_num |
本次查询到的响应数量。 |
- |
ping_total_num |
本次任务累计的响应数量。 |
- |
在管理节点上执行以下命令,查看总线设备网络检测的故障信息。
kubectl describe cm -n cluster-system pingmesh-fault-<nodename>
故障信息中各字段的详细说明如下所示。
参数 |
说明 |
取值 |
---|---|---|
mc-consumer-publicfault |
ClusterD侦听所需的label key |
true |
PublicFault |
公共故障信息key |
详细说明请参见表2。 |
故障码 |
故障说明 |
故障级别 |
---|---|---|
220001001 |
NPU卡HCCS网络故障 |
SeparateNPU 说明:
该故障级别不支持自行配置。 |