Cleaning and Diagnosing Fault Events

Procedure

  1. Import the fault event cleaning and diagnosis APIs from MindCluster Ascend FaultDiag.
    from ascend_fd import parse_knowledge_graph
    from ascend_fd import diag_knowledge_graph
  2. Clean fault events.
    # Fault event cleaning result and errors that occur during the cleaning.
    kg_parse_results, kg_parse_err_msg = parse_knowledge_graph(input_log_list, custom_entity)
  3. Diagnose the cleaned fault events.
    # Fault event diagnosis result and errors that occur during the diagnosis.
    results, err_msg_list = diag_knowledge_graph(kg_parse_results)
The input_log_list input format is as follows, which is for reference only. You need to modify the input information about root cause node cleaning as required.
[
  {
    "log_domain": {
      "server": "10.1.1.1"
    },
    "log_items": [
      {
        "item_type": "MindIE",
        "path": "/log/debug/mindie-ms_11_202411061400.log",
        "device_id": 0,
        "modification_time": "2025-08-21 23:50:59.999999",
        "component": "Controller",
        "log_lines": [
           '[ERROR] xxx.'
        ]
      }
    ]
  }
]
Table 1 input_log_list parameters

Field

Parameter Type

Required (Yes/No)

Description

log_domain

Dictionary

Yes

Log domain

server

String

Yes

Server IP address

log_items

List

Yes

Log item

item_type

String

Yes

Log type

path

String

No

Log file path. This parameter is required when the npu_info_before.txt or npu_info_after.txt file is cleaned.

device_id

Integer

No

Device ID

modification_time

String

No

Log modification time. The time indicates the fault occurrence time when the training or inference console logs and MindIE component logs are cleaned.

component

String

No

Component name. Currently, only MindIE MS Coordinator and MindIE MS Controller are supported.

log_lines

List

Yes

Log line to be parsed

The following is an example of the custom_entity input format, which is for reference only. You need to modify the information about the custom fault entity as required.

{
    "41001": {       # Fault code. Customize fault codes based on the actual situation and ensure fault codes are different from those supported by MindCluster Ascend FaultDiag.
        "attribute.class": "Software",
        "attribute.component": "AI Framework",
        "attribute.module": "Compiler",
        "attribute.cause_zh": "Failed to merge abstract types",
        "attribute.description_zh": "During gradient calculation for the function output, the abstract types do not match. As a result, the abstract types fail to be merged.",
        "attribute.suggestion_zh": [
               "1. Check whether the output type of the gradient calculation function is the same as that of sens_param. If they are different, change them to the same type;",
               "2. Error "Type Join Failed" is displayed during automatic derivation."
           ],
        "attribute.error_case": [
            "grad = ops.GradOperation(sens_param=True)",
            "# Output type of test_net: tuple (Tensor, Tensor)",
            "def test_net(a, b):",
            "    return a, b"
              ],
        "attribute.fixed_case": [
            "grad = ops.GradOperation(sens_param=True)",
            "# Output type of test_net: tuple (Tensor, Tensor)",
            "def test_net(a, b):",
            "    return a, b"
            ],
        "rule": [
            {
                "dst_code": "20106"
            }
        ],
        "source_file": "TrainLog",
        "regex.in": [
            "Abstract type", "cannot join with"
            ]
    }
}

For details about the parameters in custom_entity, see Table 1.

Table 2 err_msg_list parameters

Field

Parameter Type

Description

Error message

List

Error message generated during interface execution

The following is an example of the results output format:

[
	{
        'analyze_success': True,
        'version_info': {},
        'note': '',
        'fault': [{
                'code': 'NORMAL_OR_UNSUPPORTED',
                'component': '',
                'module': '',
                'cause_zh': 'The fault event analysis module outputs no result.'
                'description_zh': 'No result is displayed in the fault event analysis module. The possible cause is that the training job is normal. If the training job is interrupted unexpectedly and the fault cannot be rectified, contact Huawei technical support.'
                'suggestion_zh': '1. If the problem persists, contact Huawei technical support.',
                'class': '',
                'fault_source': ['1.1.1.1 device-Unknown'],
                'fault_chains': []
            }
        ]
    }
]
Table 3 results parameters

Field

Parameter Type

Description

analyze_success

Bool

Whether the diagnosis is successful.

  • True: diagnosis succeeded
  • False: diagnosis failed

version_info

Dictionary

Version information

note

String

Remarks

fault

List

Fault event list

code

String

Fault code

component

String

Faulty component

module

String

Faulty module

cause_zh

String

Failure cause

description_zh

String

Fault description

suggestion_zh

String

Suggestion

class

String

Fault type

fault_source

List

Fault source

fault_chains

List

Fault propagation chain