在检测公共故障时,故障发送方已发送故障恢复消息,但是故障芯片一直处于隔离状态,导致任务无法调度到该芯片。
针对原因一:查看ClusterD日志文件中的报错信息,默认位置为/var/log/mindx-dl/clusterd/clusterd.log。
针对原因二:informer依赖K8s API Server的Watch机制,若网络不稳定或API Server压力过大,可能导致事件丢失。可以通过优化API Server性能减小消息遗漏的发生概率。
针对原因三:查询statistic-fault-info cm,根据要恢复的故障的faultId,获取故障的详细信息。手动调用公共故障接口,构造故障恢复消息。
以恢复故障faultId:14715779为例,下面将详细介绍针对原因三的解决办法。
kubectl describe cm -n cluster-system statistic-fault-info
PublicFaults对应的内容为:
{"node173":[{"resource":"CCAE","devIds":[0,1,2],"faultId":"14715582","type":"Storage","faultCode":"010001002","level":"SubHealthFault","faultTime":1736928806},{"resource":"CCAE","devIds":[2,3,4],"faultId":"14715779","type":"Network","faultCode":"010001001","level":"SubHealthFault","faultTime":1736926605}]}
{ "nodeName": "node173", "resource": "CCAE", "devIds": [2,3,4], "faultId": "14715779", "type": "Network", "faultCode": "010001001", "level": "SubHealthFault", "faultTime": 1736926605 }
1.执行以下命令创建名为recover的yaml。
vi recover.yaml
2.编辑该yaml,将以下内容写入该yaml中。
apiVersion: v1 kind: ConfigMap metadata: namespace: mindx-dl name: mindx-dl-publicinfo labels: mc-consumer-publicfault: "true" data: PublicFault: | { "id":"11937763019444715778", "timestamp": 1741159983000, "version": "1.0", "resource": "CCAE", "faults": [ { "faultId": "14715779", "faultType": "Network", "faultCode": "010001001", "faultTime": 1736926605000, "assertion": "recover", "influence": [ { "nodeName": "node173", "deviceIds": [2,3,4] } ] } ] }
3.执行以下命令创建ConfigMap。
kubectl apply -f recover.yaml
{"node173":[{"resource":"CCAE","devIds":[0,1,2],"faultId":"14715582","type":"Storage","faultCode":"010001002" ,"level":"SeparateNPU","faultTime":1736928806}]}