(Optional) Configuring Processor Fault Frequencies and Durations

The faultCode.json and faultCustomization.json files are embedded in the Ascend Device Plugin image. When Ascend Device Plugin is started, it reads the default configurations of both files for fault handling. For details about faultCode.json and faultCustomization.json, see Configuration File Description.

If you want to customize the processor fault frequency and duration, you can create a ConfigMap file (mindx-dl-fault-config) in the cluster.

If mindx-dl-fault-config already exists in the cluster when Ascend Device Plugin is started, Ascend Device Plugin preferentially uses the content configured in the file for fault handling.
If mindx-dl-fault-config already exists in the cluster after Ascend Device Plugin is reinstalled, the default faultCustomization.json of Ascend Device Plugin does not take effect and mindx-dl-fault-config is used. If you want to use the default configuration of faultCustomization.json, delete mindx-dl-fault-config so that Ascend Device Plugin reads faultCustomization.json by default.
If the format of the ConfigMap file is incorrect, Ascend Device Plugin reads the content of the ConfigMap file built in the image by default for fault handling.

Procedure

For example, If the 80CB8002 fault repeatedly occurs on a processor, resulting in the training service to be repeatedly rescheduled, you can manually set the maximum number of resumable training times for a job within 24 hours to 2. If the maximum number is reached, the fault handling policy ManuallySeparateNPU is used.

Log in to the environment and go to the directory where the Ascend Device Plugin package is decompressed.
Check whether mindx-dl-fault-config has been created based on faultCode.json.
```
kubectl describe cm -n kube-system mindx-dl-fault-config
```
- If mindx-dl-fault-config exists and the fields related to faultCustomization.json exist, go to 4 to edit mindx-dl-fault-config.
- If mindx-dl-fault-config exists but the fields related to faultCustomization.json do not exist, save the content in mindx-dl-fault-config before deleting it, and go to 3 to recreate it.
- If mindx-dl-fault-config does not exist, go to 3 to create it.

Create the ConfigMap file (mindx-dl-fault-config) required for configuring the processor fault frequency and duration.

kubectl create cm mindx-dl-fault-config -n kube-system --from-literal="PollInterval=300" --from-file=./faultCode.json --from-file=./faultCustomization.json

Command output:

configmap/mindx-dl-fault-config created

**Table 1** Parameters
Parameter	Required (Yes/No)	Description
mindx-dl-fault-config	Yes	Name of the ConfigMap file required for dynamically configuring fault codes. The file name cannot be changed.
kube-system	Yes	Namespace where mindx-dl-fault-config is located. The namespace name cannot be changed.
PollInterval	No	If this parameter is not specified, 300s is used by default. This parameter specifies the interval (in seconds) for checking updates to the mindx-dl-fault-config file. The value ranges from 30 to 3600. The modification to PollInterval takes effect in the next interval.
faultCode.json	Yes	Used to store fault codes. The parameter value must be the same as the name of the faultCode.json file.
faultCustomization.json	No	Used to customize the graceful fault tolerance time, fault frequency, fault duration (only for parameter plane network faults), and other configurations. If this parameter is not specified, no fault frequency configuration is available, and the remaining configurations will be processed using default values. The parameter value must be the same as the name of the faultCustomization.json file.

Edit the mindx-dl-fault-config file.

kubectl edit cm -n kube-system mindx-dl-fault-config

Change the fault frequency and duration of the processor as required.

# Please edit the object below. Lines beginning with a '#' will be ignored,
# and an empty file will abort the edit. If an error occurs while saving this file will be
# reopened with the relevant failures.
#
apiVersion: v1
data:
PollInterval: "300"
# Change the processor fault level.
faultCode.json: |
{
"NotHandleFaultCodes":[
...
}
# Change the processor fault frequency and duration.
faultCustomization.json: |
{
  "GraceTolerance": {
    "WaitProcessReadCMTime": 30,
    "WaitDeviceResetTime": 150,
    "WaitFaultSelfHealingTime": 15
},
  "FaultFrequency": [
{
    "EventId": [
      "80C98000","80B78000","80B58000","80A18008","80A38008","80A58008","80B98000","80B98008","80BB8000",
      "80BB8008","80BD8000","80BD8008","80C78008","80C98008","80CB8008","80CD8008","80CF8008","80D98008",
      "80DF8008","80DE1801","80E01801","80E18008","80E38008","80E39200","80E3A202","80E3A203","80E78000",
      "80E78008","80F18000","80F18008","80F38008","80F78008","81318008","81338008","813B8008","81478008",
      "81578008","815F8008","81938008","81958008","81978008"
],
    "TimeWindow": 86400,
    "Times": 2,
    "FaultHandling": "ManuallySeparateNPU"
},
{
"EventId": ["80E18005"],
"TimeWindow": 86400,
"Times": 3,
"FaultHandling": "ManuallySeparateNPU"
}
],
  "FaultDuration": [
{
    "EventId": ["81078603"],
    "FaultTimeout": 20,
    "RecoverTimeout": 60,
    "FaultHandling": "PreSeparateNPU"
}
]
}
kind: ConfigMap
metadata:
creationTimestamp: "2024-06-20T10:12:07Z"
name: mindx-dl-fault-config
namespace: kube-system
resourceVersion: "52893696"
selfLink: /api/v1/namespaces/kube-system/configmaps/mindx-dl-fault-config
uid: bba9e17f-41dd-43b3-848e-3d29cb8c595a

In the mindx-dl-fault-config file, add the following code to the FaultFrequency field to set the maximum number of resumable training times supported by a job within 24 hours to 2 for the 80CB8002 fault, as well as setting the fault handling policy to ManuallySeparateNPU when the maximum number is reached.
```
{
  "EventId": ["80CB8002"],
  "TimeWindow": 86400,
  "Times": 2,      
  "FaultHandling": "ManuallySeparateNPU"
}
```
After the modification, press Esc, enter :wq!, save the configuration, and exit.
After the mindx-dl-fault-config file is updated and takes effect (value of PollInterval being 300s if not specified), check whether the operation is successful.
1. Query the log name of Ascend Device Plugin.
```
kubectl get pods -A | grep ascend-device-plugin
```
  Command output:
  1
  kube-system ascend-device-plugin-daemonset-910-jmlf5 1/1 Running 0 6h34m
2. Query Ascend Device Plugin log information based on the obtained log name.
```
kubectl logs -n kube-system ascend-device-plugin-daemonset-910-jmlf5
```
  If "load fault customization from configmap complete" is displayed in the log, the fault frequency is manually configured.
  If "modify xxx success" is displayed in the log, the xxx parameter in faultCustomization.json in the ConfigMap file is set successfully.
  If "insert fault frequency success" is displayed in the log, the occurrence time of a frequency fault is recorded. In the frequency window, when the recorded number of occurrences of the fault reaches the preset threshold, the fault level corresponding to the frequency fault is reported.
(Optional) Manually recover the processor that is forcibly isolated. When ManuallySeparateNPU is used, the processor is still isolated after the fault is rectified. You need to manually recover the processor that is forcibly isolated.
1. Query the device-info-cm reported by Ascend Device Plugin on the node.
```
kubectl get cm -n kube-system | grep deviceinfo | grep {nodeName}
```
2. Edit the device-info-cm.
```
kubectl edit cm -n kube-system {configMapName}
```
3. Delete the name of the processor that has been recovered following ManuallySeparateNPU under data.
```
apiVersion: v1
kind: ConfigMap
data:
  DeviceInfoCfg: '{"DeviceInfo":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-Fault":"[]","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1718702470},"CheckCode":"4f00cf1d220da26a8fdbeb5ba163a751d4b264c48b81d22149257e272ae3b413"}'
  ManuallySeparateNPU: Ascend910-0  
```
  Delete all processor names following the ManuallySeparateNPU field and set the value to an empty string ("").
4. After the modification, press Esc, enter :wq!, save the configuration, and exit.
5. After each reporting period—within the health status check interval if device information changes, or every 5 minutes if it remains unchanged—verify whether the deleted processor's name still exists in ManuallySeparateNPU in device-info-cm. If no, the processor is recovered and can be used.
```
kubectl describe cm -n kube-system {configMapName}
```

Parent topic: Processor Faults