Volcano
Volcano collects internal processor faults, parameter plane network faults, and node faults, and stores the collected information as external information in ConfigMap of Kubernetes for external query and use.
Name: vcjob-fault-npu-cm
Namespace: volcano-system
Labels: <none>
Annotations: <none>
Data
====
fault-node:
----
[{"FaultDeviceList":[{"fault_type":"CardNetworkUnhealthy","npu_name":"Ascend910-0","fault_level":"PreSeparateNPU","fault_handling":"PreSeparateNPU","large_model_fault_level":"PreSeparateNPU","fault_code":"81078603"},{"fault_type":"CardUnhealthy","npu_name":"Ascend910-4","fault_level":"SeparateNPU","fault_handling":"SeparateNPU","large_model_fault_level":"SeparateNPU","fault_code":"A8028801,A4028801,80E18402,80E18401"}],"NodeName":"node133","UnhealthyNPU":["Ascend910-4"],"NetworkUnhealthyNPU":["Ascend910-0"],"NodeDEnable":true,"NodeHealthState":"CardUnhealthy","UpdateTime":1744182212}]
remain-retry-times:
----
BinaryData
====
Events: <none>
Parameter |
Description |
Value |
Remarks |
|---|---|---|---|
fault-node |
Information about the faulty node |
- |
- |
- NodeName |
Node name |
String |
- |
- UpdateTime |
- |
64-bit integer |
- |
- UnhealthyNPU |
Faulty processors on the faulty node |
String slice |
- |
- NetworkUnhealthyNPU |
Processors with faulty network on the faulty node |
String slice |
- |
- NodeDEnable |
Whether to detect node status |
|
- |
- NodeHealthState |
Node health status |
String |
- |
FaultDeviceList |
- |
- |
- |
- fault_type |
Fault type object |
|
- |
- npu_name |
Name of the faulty processor. This parameter is left empty if the node is faulty. |
String |
- |
- fault_level |
Fault handling type. This parameter is left empty for node faults. |
|
NOTE:
|
- fault_handling |
|||
- large_model_fault_level |
|||
- fault_code |
Fault code, a string of characters separated by commas (,). |
String |
|
remain-retry-times |
Information about remaining jobs that can be rescheduled |
- |
- |
- UUID |
Job UID |
String |
- |
- Times |
Number of times that remaining jobs can be rescheduled |
Integer |
- |