Cleaning and Diagnosing Fault Events
Procedure
- Import the fault event cleaning and diagnosis APIs from MindCluster Ascend FaultDiag.
from ascend_fd import parse_knowledge_graph from ascend_fd import diag_knowledge_graph
- Clean fault events.
# Fault event cleaning result and errors that occur during the cleaning. kg_parse_results, kg_parse_err_msg = parse_knowledge_graph(input_log_list, custom_entity)
- Diagnose the cleaned fault events.
# Fault event diagnosis result and errors that occur during the diagnosis. results, err_msg_list = diag_knowledge_graph(kg_parse_results)
[
{
"log_domain": {
"server": "10.1.1.1"
},
"log_items": [
{
"item_type": "MindIE",
"path": "/log/debug/mindie-ms_11_202411061400.log",
"device_id": 0,
"modification_time": "2025-08-21 23:50:59.999999",
"component": "Controller",
"log_lines": [
'[ERROR] xxx.'
]
}
]
}
]
Field |
Parameter Type |
Required (Yes/No) |
Description |
|---|---|---|---|
log_domain |
Dictionary |
Yes |
Log domain |
server |
String |
Yes |
Server IP address |
log_items |
List |
Yes |
Log item |
item_type |
String |
Yes |
Log type |
path |
String |
No |
Log file path. This parameter is required when the npu_info_before.txt or npu_info_after.txt file is cleaned. |
device_id |
Integer |
No |
Device ID |
modification_time |
String |
No |
Log modification time. The time indicates the fault occurrence time when the training or inference console logs and MindIE component logs are cleaned. |
component |
String |
No |
Component name. Currently, only MindIE MS Coordinator and MindIE MS Controller are supported. |
log_lines |
List |
Yes |
Log line to be parsed |
The following is an example of the custom_entity input format, which is for reference only. You need to modify the information about the custom fault entity as required.
{
"41001": { # Fault code. Customize fault codes based on the actual situation and ensure fault codes are different from those supported by MindCluster Ascend FaultDiag.
"attribute.class": "Software",
"attribute.component": "AI Framework",
"attribute.module": "Compiler",
"attribute.cause_zh": "Failed to merge abstract types",
"attribute.description_zh": "During gradient calculation for the function output, the abstract types do not match. As a result, the abstract types fail to be merged.",
"attribute.suggestion_zh": [
"1. Check whether the output type of the gradient calculation function is the same as that of sens_param. If they are different, change them to the same type;",
"2. Error "Type Join Failed" is displayed during automatic derivation."
],
"attribute.error_case": [
"grad = ops.GradOperation(sens_param=True)",
"# Output type of test_net: tuple (Tensor, Tensor)",
"def test_net(a, b):",
" return a, b"
],
"attribute.fixed_case": [
"grad = ops.GradOperation(sens_param=True)",
"# Output type of test_net: tuple (Tensor, Tensor)",
"def test_net(a, b):",
" return a, b"
],
"rule": [
{
"dst_code": "20106"
}
],
"source_file": "TrainLog",
"regex.in": [
"Abstract type", "cannot join with"
]
}
}
Field |
Parameter Type |
Description |
|---|---|---|
Error message |
List |
Error message generated during interface execution |
The following is an example of the results output format:
[
{
'analyze_success': True,
'version_info': {},
'note': '',
'fault': [{
'code': 'NORMAL_OR_UNSUPPORTED',
'component': '',
'module': '',
'cause_zh': 'The fault event analysis module outputs no result.'
'description_zh': 'No result is displayed in the fault event analysis module. The possible cause is that the training job is normal. If the training job is interrupted unexpectedly and the fault cannot be rectified, contact Huawei technical support.'
'suggestion_zh': '1. If the problem persists, contact Huawei technical support.',
'class': '',
'fault_source': ['1.1.1.1 device-Unknown'],
'fault_chains': []
}
]
}
]
Field |
Parameter Type |
Description |
|---|---|---|
analyze_success |
Bool |
Whether the diagnosis is successful.
|
version_info |
Dictionary |
Version information |
note |
String |
Remarks |
fault |
List |
Fault event list |
code |
String |
Fault code |
component |
String |
Faulty component |
module |
String |
Faulty module |
cause_zh |
String |
Failure cause |
description_zh |
String |
Fault description |
suggestion_zh |
String |
Suggestion |
class |
String |
Fault type |
fault_source |
List |
Fault source |
fault_chains |
List |
Fault propagation chain |
