(Optional) Configuring Interconnect Device Fault Levels

The fault level configuration file SwitchFaultCode.json is embedded in the Ascend Device Plugin image. When Ascend Device Plugin is started, the default configurations of the file are used for fault handling.

If you want to customize the fault level or graceful fault tolerance configuration, you can create a ConfigMap file (mindx-dl-fault-config) in the cluster.

  • If mindx-dl-fault-config already exists in the cluster when Ascend Device Plugin is started, Ascend Device Plugin preferentially uses the content configured in the file for fault handling.
  • If mindx-dl-fault-config already exists in the cluster after Ascend Device Plugin is reinstalled, the default SwitchFaultCode.json of Ascend Device Plugin does not take effect, and Ascend Device Plugin uses mindx-dl-fault-config.
  • If mindx-dl-fault-config already exists in the cluster and the SwitchFaultCode.json field exists in the ConfigMap file after Ascend Device Plugin is re-installed, the default SwitchFaultCode.json file of Ascend Device Plugin does not take effect, and Ascend Device Plugin uses mindx-dl-fault-config.
  • If you want to use the default configuration of SwitchFaultCode.json, delete mindx-dl-fault-config so that Ascend Device Plugin reads SwitchFaultCode.json by default.
  • If the format of the ConfigMap file is incorrect, Ascend Device Plugin reads the content of the ConfigMap file built in the image by default for fault handling.

Configuring Fault Levels Using SwitchFaultCode.json

The following uses the interconnect device fault with code [0x00f1ff09,155913,cpu,na] as an example. A fault code consists of four parts: alarm ID, fault ID, peer device type, and port number, as shown in Table 1 Fault code description.

Table 1 Fault code description

Parameter

Description

Value

Alarm ID

In the preceding example, the alarm ID is 0x00f1ff09.

Same between in-band and out-of-band

Fault ID

In the preceding example, the fault ID is 155913.

Same between in-band and out-of-band

Peer device type

Type of the peer device corresponding to the fault

In the preceding example, the peer device type is cpu.

  • na: The fault is a processor fault and does not involve the peer device.
  • cpu: The peer device corresponding to the fault is CPU.
  • npu: The peer device corresponding to the fault is NPU.
  • L2: The peer device corresponding to the fault is L2.

Port number

In the preceding example, the port number is na.

The value can only be na.

To change the fault handling policy from NotHandleFaultCodes to SeparateFaultCodes, perform the following steps:

  1. Log in to the environment and go to the directory generated after Ascend Device Plugin decompression.
  2. Run the following command to check whether mindx-dl-fault-config has been created based on SwitchFaultCode.json:
    kubectl describe cm -n kube-system mindx-dl-fault-config
    • If mindx-dl-fault-config exists and the fields related to SwitchFaultCode.json exist, go to 4 to edit mindx-dl-fault-config.
    • If mindx-dl-fault-config exists but the fields related to SwitchFaultCode.json do not exist, save the content in mindx-dl-fault-config before deleting the file, and go to 3 to recreate the file.
    • If mindx-dl-fault-config does not exist, go to 3 to create it.
  3. Run the following command to create the ConfigMap file (mindx-dl-fault-config) required for dynamically configuring fault codes:
    kubectl create cm mindx-dl-fault-config -n kube-system  --from-file=./faultCode.json --from-file=./SwitchFaultCode.json --from-literal="PollInterval=300"
    Command output:
    1
    configmap/mindx-dl-fault-config created
    
    Table 2 Parameters

    Parameter

    Mandatory (Yes/No)

    Description

    mindx-dl-fault-config

    Yes

    Name of the ConfigMap file required for dynamically configuring fault codes. The file name cannot be changed.

    kube-system

    Yes

    Namespace where mindx-dl-fault-config is located. The namespace name cannot be changed.

    SwitchFaultCode.json

    Yes

    Used to store fault codes. The parameter value must be the same as the name of the SwitchFaultCode.json file.

  4. Run the following command to edit the mindx-dl-fault-config file:
    kubectl edit cm -n kube-system mindx-dl-fault-config
  5. Find the fault code [0x00f1ff09,155913,cpu,na] in the mindx-dl-fault-config file.
    Data
    ====
    SwitchFaultCode.json:
    ----
    {"NotHandleFaultCodes":[0x00f1ff09,155913,cpu,na],
    ...
  6. Delete the fault code from NotHandleFaultCodes and add it to SeparateFaultCodes.
    Data
    ====
    SwitchFaultCode.json:
    ----
    {"NotHandleFaultCodes":[],
    ...
    "SeparateFaultCodes":["0x00f1ff09,155913,cpu,na","[0x00f103b0,155907,na,na]"…]
    }
  7. After the modification, press Esc, enter :wq!, save the configuration, and exit.
  8. After the mindx-dl-fault-config file is updated and takes effect (value of PollInterval being 300s if it is not specified), check whether the operation is successful.
    1. Run the following command to query the log name of Ascend Device Plugin:
      kubectl get pods -A | grep ascend-device-plugin
      Command output:
      1
      kube-system      ascend-device-plugin-daemonset-910-jmlf5   1/1     Running   0              6h34m
      
    2. Query the Ascend Device Plugin log information based on the obtained log name.
      kubectl logs -n kube-system ascend-device-plugin-daemonset-910-jmlf5

      If "load switch fault code from configmap success" is displayed in the log, the fault code is manually configured.