(Optional) Customizing Fault Entities

You can customize fault entities and extend fault types supported by MindCluster Ascend FaultDiag by adding, querying, or deleting custom fault entities. The new faults can be saved in the ${HOME}/.ascend_faultdiag/custom-ascend-kg-config.json file. When you perform log cleaning, dump, and fault diagnosis, MindCluster Ascend FaultDiag automatically loads custom fault files and fault files supported by MindCluster Ascend FaultDiag in the corresponding path.

If you need to customize a path for storing fault files, see Customizing the Home Directory of MindCluster Ascend FaultDiag.

Procedure

  1. Add or modify a custom fault entity using a JSON file.
    ascend-fd entity --update updated_entity.json
    If the following information is displayed, the operation is successful:
    Updated entity successfully.
    The following is an example of the JSON file, which is for reference only. You need to modify the information about the custom fault as required. A JSON file supports a maximum of 1,000 custom faults. Any additional faults will not be saved to the system. For details about the parameters in the file, see Table 1.
    {
        "41001": {# Fault code. You need to customize the fault code based on the actual situation. The fault code must be different from those supported by MindCluster Ascend FaultDiag.
            "attribute.class": "Software",
            "attribute.component": "AI Framework",
            "attribute.module": "Compiler",
            "attribute.cause_zh": "Failed to merge abstract types",
            "attribute.description_zh": "During gradient calculation for the function output, the abstract types do not match. As a result, the abstract types fail to be merged.",
            "attribute.suggestion_zh": [
                   "1. Check whether the output type of the gradient calculation function is the same as that of sens_param. If they are different, change them to the same type;",
                   "2. Error "Type Join Failed" is displayed during automatic derivation."
               ],
            "attribute.cause_en": "Abstract type merge failed",
            "attribute.description_en": "When computing the gradient of a function output, the abstract types do not match, leading to a failure in abstract type merging.",
            "attribute.suggestion_en": [
                   "1. Check whether the output type of the gradient calculation function matches the type of sens_param. If they do not match, modify them to be of the same type.",
                   "2. Automatic differentiation reports an error: Type Join Failed."
               ],
            "attribute.error_case": [
                "grad = ops.GradOperation(sens_param=True)",
                "# Output type of test_net: tuple (Tensor, Tensor)",
                "def test_net(a, b):",
                "    return a, b"
                  ],
            "attribute.fixed_case": [
                "grad = ops.GradOperation(sens_param=True)",
                "# Output type of test_net: tuple (Tensor, Tensor)",
                "def test_net(a, b):",
                "    return a, b"
                ],
            "rule": [
                {
                    "dst_code": "20106"
                }
            ],
            "source_file": "TrainLog",
            "regex.in": [
                "Abstract type", "cannot join with"
                ]
        },
        "41002": {                #Fault code. You need to customize the fault code based on the actual situation. The fault code must be different from those supported by MindCluster Ascend FaultDiag.
            "attribute.class": "",
            "attribute.component": "",
            "attribute.module": "",
            "attribute.cause_zh": "",
            "attribute.description_zh": "",
            "attribute.suggestion_zh": "",
            "attribute.cause_en": "",
            "attribute.description_en": "",
            "attribute.suggestion_en": "",
            "attribute.error_case": "",
            "attribute.fixed_case": "",
            "rule": [
                {
                    "dst_code": "20107"
                }
            ],
            "source_file": "CANN_Plog",
            "regex.in": [
                    "tsd client wait response fail"
                ]
        }
    ...
    }

    In the JSON file example, 41001 and 41002 are custom fault codes. The value must be a string of 1 to 50 characters, including letters, digits, underscores (_), and hyphens (-). The fault codes cannot be the same as those supported by MindCluster Ascend FaultDiag.

    Table 1 Parameters

    Parameter

    Value Type

    Description

    Required (Yes/No)

    Value Description

    attribute.class

    String

    Fault type

    Yes

    The value is a string of 1 to 50 characters, including letters, digits, English symbols, and spaces.

    attribute.component

    String

    Faulty component

    Yes

    attribute.module

    String

    Faulty module

    Yes

    attribute.cause_zh

    String

    Fault cause (Chinese)

    Yes

    The value can contain 1 to 200 characters, including letters, digits, English symbols, Chinese characters, Chinese symbols, and spaces.

    attribute.cause_en

    String

    Fault cause (English)

    No

    The value is a string of 1 to 200 characters, including letters, digits, English symbols, and spaces.

    attribute.description_zh

    String

    Fault description (Chinese)

    Yes

    Character strings or lists are supported. A character string is a complete piece of information and supports line breaks. In a list, each element is a line of information, and the combination of the elements forms the entire piece of information.
    • Character string: The value can contain 1 to 2000 characters, including letters, digits, English characters, Chinese characters, spaces, and "\n".
    • List: The value of each element in a list contains 1 to 200 characters, including letters, digits, English symbols, Chinese characters, Chinese symbols, and spaces.

    attribute.description_en

    String

    Fault description (Chinese)

    No

    attribute.suggestion_zh

    String

    Handling suggestions (Chinese)

    Yes

    attribute.suggestion_en

    String

    Handling suggestions (English)

    No

    attribute.error_case

    String

    Error example

    No

    attribute.fixed_case

    String

    Correction example

    No

    rule

    List

    Fault chain, which stores all lower-level fault entities triggered by a fault.

    No

    The list contains the following fields:

    • dst_code (required): fault code of a lower-level fault entity triggered by a fault. The fault code must be supported by MindCluster Ascend FaultDiag or be a custom fault code.
    • expression (optional) (reserved): fault triggering constraint. The value is a string of 1 to 200 characters, including letters, digits, English symbols, and spaces.

    source_file

    String

    Fault log file

    Yes

    Log file name for each log type.

    You can customize a configuration file type or use the default type. You can configure multiple log types, separated by vertical bars (|), for example, "TrainLog|CANN_Plog". A maximum of 10 log types can be configured, and each string contains 1 to 50 characters, including letters, digits, English symbols, and spaces.

    The following lists the default log file types. (For details about the corresponding storage directories, see Table 1.)

    • TrainLog: training or inference console logs
    • CANN_Plog: App logs on the host
    • CANN_Device: App logs on the device
    • NPU_OS: system logs of Ctrl CPU on the device and event-level system logs of Ctrl CPU on the device
    • NPU_Device: System logs of non-Ctrl CPUs on the device
    • NPU_History: Black Box logs
    • OS: Host OS log file
    • OS-dmesg: kernel message file on the host
    • OS-vmcore-dmesg: host kernel message file saved when the system breaks down
    • OS-sysmon: system monitoring file on the host.
    • NodeDLog: AI server logs
    • DL_DevicePlugin: SuperPoD device logs and Ascend Device Plugin logs
    • DL_Volcano_Scheduler: volcano-scheduler logs
    • DL_Volcano_Controller: volcano-controller logs
    • DL_Docker_Runtime: Ascend Docker Runtime logs
    • DL_Npu_Exporter: NPU Exporter logs
    • MindIE: MindIE logs
    • CANN_Amct: AMCT logs

    regex.in

    String

    Fault keyword

    Yes

    Level-1 and level-2 lists are supported.
    • Level-1 list
      • Each element is a character string. The value can contain 1 to 200 characters, including letters, digits, English symbols, Chinese characters, Chinese symbols, and spaces.
      • Each keyword in the list must satisfy the existence criteria and adhere to the sequence requirements.
    • Level-2 list
      • Each sublist meets the value requirements of the level-1 list.
      • The judgement rule in each sublist is the same as that in a level-1 list. The sublists are in the OR relationship, and only a keyword requirement of one sublist needs to be met.
    • When a custom fault entity is added, all required fields must be stored in the JSON file and meet the value requirements.
    • When modifying a custom fault entity, ensure that the value meets the requirements.
  2. Query the custom fault entity information. You can view fault information by fault code. If no fault code is specified, information about all custom fault entities is queried.
    ascend-fd entity --show entity_code_1 entity_code_2
  3. (Optional) Delete the custom fault entity information corresponding to a specified fault code.
    ascend-fd entity --delete entity_code_1 entity_code_2
  4. (Optional) Verify the custom-ascend-kg-config.json file. If you directly modify the custom fault entity information in the custom-ascend-kg-config.json file, run the following command to verify the integrity and availability of the modified file.

    You are not advised to directly modify the custom-ascend-kg-config.json file; otherwise, MindCluster Ascend FaultDiag may be abnormal.

    ascend-fd entity --check custom-ascend-kg-config.json
    If the following information is displayed, the file verification is successful:
    Custom entity verification passed.