Recovery Message of a Public Fault Is Missing, Causing Faulty Chip Isolation
Symptom
During common fault detection, a fault recovery message is sent by the fault sender. However, the faulty chip remains isolated, preventing jobs from being scheduled to it.
Cause Analysis
- Cause 1: The fault sender fails to send the fault recovery message to ClusterD.
- Cause 2: The fault recovery message is successfully sent, but some messages are missing in the Kubernetes informer message queue.
- Cause 3: After the fault recovery message is received, ClusterD is restarted. As a result, the fault cache in the memory is cleared before being updated to statistic-fault-info ConfigMap.
Solution
Cause 1: Check the error information in the ClusterD log file (/var/log/mindx-dl/clusterd/clusterd.log by default).
Cause 2: Informer depends on the watch mechanism of the Kubernetes API server. If the network is unstable or the API Server is overloaded, events may be lost. You can optimize the API Server performance to reduce the probability of message loss.
Cause 3: Query statistic-fault-info ConfigMap to obtain the fault details based on the fault ID of the fault to be rectified. Then, manually call the Public Fault APIs to construct a fault recovery message.
The following uses faultId:14715779 as an example to describe the detailed procedure for addressing cause 3.
- Query statistic-fault-info ConfigMap.
kubectl describe cm -n cluster-system statistic-fault-info
The content corresponding to PublicFaults is as follows:
{"node173":[{"resource":"CCAE","devIds":[0,1,2],"faultId":"14715582","type":"Storage","faultCode":"010001002","level":"SubHealthFault","faultTime":1736928806},{"resource":"CCAE","devIds":[2,3,4],"faultId":"14715779","type":"Network","faultCode":"010001001","level":"SubHealthFault","faultTime":1736926605}]} - Record the fault information related to faultId 14715779 in PublicFaults in 1.
{ "nodeName": "node173", "resource": "CCAE", "devIds": [2,3,4], "faultId": "14715779", "type": "Network", "faultCode": "010001001", "level": "SubHealthFault", "faultTime": 1736926605 } - Call the public fault interface to construct a fault recovery message. The ConfigMap API is used as an example.
- Create a YAML file named recover.
vi recover.yaml
- Edit the YAML file and add the following content to the file.
apiVersion: v1 kind: ConfigMap metadata: namespace: mindx-dl name: mindx-dl-publicinfo labels: mc-consumer-publicfault: "true" data: PublicFault: | { "id":"11937763019444715778", "timestamp": 1741159983000, "version": "1.0", "resource": "CCAE", "faults": [ { "faultId": "14715779", "faultType": "Network", "faultCode": "010001001", "faultTime": 1736926605000, # The value of this field in the ConfigMap queried in step 1 is a 10-digit timestamp. Manually change the value to a 13-digit timestamp. "assertion": "recover", "influence": [ { "nodeName": "node173", "deviceIds": [2,3,4] } ] } ] } - Create a ConfigMap.
kubectl apply -f recover.yaml
- Create a YAML file named recover.
- Query statistic-fault-info ConfigMap again to view the content of PublicFaults. The fault with ID 14715779 has been rectified, and the fault is not displayed in cluster-info-cm.
{"node173":[{"resource":"CCAE","devIds":[0,1,2],"faultId":"14715582","type":"Storage","faultCode":"010001002" ,"level":"SeparateNPU","faultTime":1736928806}]}