Volcano

Volcano collects internal processor faults, parameter plane network faults, and node faults, and stores the collected information as external information in ConfigMap of Kubernetes for external query and use.

The query command is kubectl describe cm -n volcano-system vcjob-fault-npu-cm. The command output is as follows. For details about the key parameters, see Table 1.

Name:         vcjob-fault-npu-cm
Namespace:    volcano-system
Labels:       <none>
Annotations:  <none>

Data
====
fault-node:
----
[{"FaultDeviceList":[{"fault_type":"CardNetworkUnhealthy","npu_name":"Ascend910-0","fault_level":"PreSeparateNPU","fault_handling":"PreSeparateNPU","large_model_fault_level":"PreSeparateNPU","fault_code":"81078603"},{"fault_type":"CardUnhealthy","npu_name":"Ascend910-4","fault_level":"SeparateNPU","fault_handling":"SeparateNPU","large_model_fault_level":"SeparateNPU","fault_code":"A8028801,A4028801,80E18402,80E18401"}],"NodeName":"node133","UnhealthyNPU":["Ascend910-4"],"NetworkUnhealthyNPU":["Ascend910-0"],"NodeDEnable":true,"NodeHealthState":"CardUnhealthy","UpdateTime":1744182212}]
remain-retry-times:
----


BinaryData
====

Events:  <none>

**Table 1** Description of the vcjob-fault-npu-cm field
Parameter	Description	Value	Remarks
fault-node	Information about the faulty node	-	-
- NodeName	Node name	String	-
- UpdateTime	-	64-bit integer	-
- UnhealthyNPU	Faulty processors on the faulty node	String slice	-
- NetworkUnhealthyNPU	Processors with faulty network on the faulty node	String slice	-
- NodeDEnable	Whether to detect node status	True False	-
- NodeHealthState	Node health status	String	-
FaultDeviceList	-	-	-
- fault_type	Fault type object	CardUnhealthy: processor fault CardNetworkUnhealthy: processor network fault NodeUnhealthy: node fault	-
- npu_name	Name of the faulty processor. This parameter is left empty if the node is faulty.	String	-
- fault_level	Fault handling type. This parameter is left empty for node faults.	NotHandleFault: requires no handling. RestartRequest: re-executes inference requests in the inference scenario, or re-executes training services in the training scenario. RestartBusiness: re-executes services. FreeRestartNPU: resets an idle processor when faults affect service execution. RestartNPU: directly resets processors and re-executes services. SeparateNPU: isolates processors. PreSeparateNPU: pre-isolates processors and determines whether to perform rescheduling based on the actual running status of the training job.	NOTE: The functions of large_model_fault_level, fault_level and fault_handling are the same. fault_handling is recommended. When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled.
- fault_handling
- large_model_fault_level
- fault_code	Fault code, a string of characters separated by commas (,).	String	Disconnected: The processor network is disconnected. heartbeatTimeOut: The node status is lost.
remain-retry-times	Information about remaining jobs that can be rescheduled	-	-
- UUID	Job UID	String	-
- Times	Number of times that remaining jobs can be rescheduled	Integer	-

Parent topic: Query the Reported Fault Information