(Optional) Configuring Processor Fault Levels

The faultCode.json and faultCustomization.json files are embedded in the Ascend Device Plugin image. When Ascend Device Plugin is started, it reads the default configurations of both files for fault handling. For details about faultCode.json and faultCustomization.json, see Configuration File Description.

If you want to customize the fault level or graceful fault tolerance configuration, you can create a ConfigMap file (mindx-dl-fault-config) in the cluster.

  • If mindx-dl-fault-config already exists in the cluster when Ascend Device Plugin is started, Ascend Device Plugin preferentially uses the content configured in the file for fault handling.
  • If mindx-dl-fault-config already exists in the cluster after Ascend Device Plugin is reinstalled, the default faultCode.json of Ascend Device Plugin does not take effect, and Ascend Device Plugin uses mindx-dl-fault-config.
  • If you want to use the default configuration of faultCode.json or faultCustomization.json, delete mindx-dl-fault-config so that Ascend Device Plugin reads faultCode.json, SwitchFaultCode.json, or faultCustomization.json by default.
  • If the format of the ConfigMap file is incorrect, Ascend Device Plugin reads the content of the ConfigMap file built in the image by default for fault handling.

Configuring Fault Levels Using faultCode.json

Take a node status detection fault whose name is dmp_daemon and fault code is 80E21007 as an example. To change the fault handling policy from NotHandleFaultCodes to RestartNPUCodes, perform the following steps:

  1. Log in to the environment and go to the directory generated after Ascend Device Plugin decompression.
  2. Run the following command to create the ConfigMap file (mindx-dl-fault-config) required for dynamically configuring fault codes:
    kubectl create cm mindx-dl-fault-config -n kube-system --from-literal="PollInterval=300" --from-file=./faultCode.json
    Command output:
    1
    configmap/mindx-dl-fault-config created
    
    Table 1 Parameters

    Parameter

    Required (Yes/No)

    Description

    mindx-dl-fault-config

    Yes

    Name of the ConfigMap file required for dynamically configuring fault codes. The file name cannot be changed.

    kube-system

    Yes

    Namespace where mindx-dl-fault-config is located. The namespace name cannot be changed.

    PollInterval

    No

    If this parameter is not specified, 300s is used by default. This parameter specifies the interval (in seconds) for checking updates to the mindx-dl-fault-config file. The value ranges from 30 to 3600. The modification to PollInterval takes effect in the next interval.

    faultCode.json

    Yes

    Used to store fault codes. The parameter value must be the same as the name of the faultCode.json file.

  3. Run the following command to edit the mindx-dl-fault-config file:
    kubectl edit cm -n kube-system mindx-dl-fault-config
  4. Find the fault code 80E21007 in the mindx-dl-fault-config file.
    "NotHandleFaultCodes":[
        
    "80E21007","80E38003","80F78006","80C98006","80CB8006","81318006","80A18006","80A18005","80FB8000","8C1F8609",
    ...
      ],
    ...

    You can configure multiple fault levels for a fault code, but the fault is handled as the configured highest level by default.

  5. Delete the fault code 80E21007 from NotHandleFaultCodes and add it to RestartNPUCodes.
    "NotHandleFaultCodes":[ 
         "80E38003","80F78006","80C98006","80CB8006","81318006","80A18006","80A18005","80FB8000","8C1F8609",
    ...
      ],
    ...
    "RestartNPUCodes":[
       "8C204E00","A8028802","A4302003","A4302004","A4302005","A4302006","A4302009","A430200A","80CF8009","80CF8008","80E21007",... 
    ...
       ],
  6. After the modification, press Esc, enter :wq!, save the configuration, and exit.
  7. After the mindx-dl-fault-config file is updated and takes effect (value of PollInterval being 300s if it is not specified), check whether the operation is successful.
    1. Run the following command to query the log name of Ascend Device Plugin:
      kubectl get pods -A | grep ascend-device-plugin
      Command output:
      1
      kube-system      ascend-device-plugin-daemonset-910-jmlf5   1/1     Running   0              6h34m
      
    2. Query the Ascend Device Plugin log information based on the obtained log name.
      kubectl logs -n kube-system ascend-device-plugin-daemonset-910-jmlf5

      If "load fault code from configmap success" is displayed in the log, the fault code is manually configured.