Volcano

Volcano collects internal processor faults, parameter plane network faults, and node faults, and stores the collected information as external information in ConfigMap of Kubernetes for external query and use.

The query command is kubectl describe cm -n volcano-system vcjob-fault-npu-cm. The command output is as follows. For details about the key parameters, see Table 1.
Name:         vcjob-fault-npu-cm
Namespace:    volcano-system
Labels:       <none>
Annotations:  <none>

Data
====
fault-node:
----
[{"FaultDeviceList":[{"fault_type":"CardNetworkUnhealthy","npu_name":"Ascend910-0","fault_level":"PreSeparateNPU","fault_handling":"PreSeparateNPU","large_model_fault_level":"PreSeparateNPU","fault_code":"81078603"},{"fault_type":"CardUnhealthy","npu_name":"Ascend910-4","fault_level":"SeparateNPU","fault_handling":"SeparateNPU","large_model_fault_level":"SeparateNPU","fault_code":"A8028801,A4028801,80E18402,80E18401"}],"NodeName":"node133","UnhealthyNPU":["Ascend910-4"],"NetworkUnhealthyNPU":["Ascend910-0"],"NodeDEnable":true,"NodeHealthState":"CardUnhealthy","UpdateTime":1744182212}]
remain-retry-times:
----


BinaryData
====

Events:  <none>
Table 1 Description of the vcjob-fault-npu-cm field

Parameter

Description

Value

Remarks

fault-node

Information about the faulty node

-

-

- NodeName

Node name

String

-

- UpdateTime

-

64-bit integer

-

- UnhealthyNPU

Faulty processors on the faulty node

String slice

-

- NetworkUnhealthyNPU

Processors with faulty network on the faulty node

String slice

-

- NodeDEnable

Whether to detect node status

  • True
  • False

-

- NodeHealthState

Node health status

String

-

FaultDeviceList

-

-

-

- fault_type

Fault type object

  • CardUnhealthy: processor fault
  • CardNetworkUnhealthy: processor network fault
  • NodeUnhealthy: node fault

-

- npu_name

Name of the faulty processor. This parameter is left empty if the node is faulty.

String

-

- fault_level

Fault handling type. This parameter is left empty for node faults.

  • NotHandleFault: requires no handling.
  • RestartRequest: re-executes inference requests in the inference scenario, or re-executes training services in the training scenario.
  • RestartBusiness: re-executes services.
  • FreeRestartNPU: resets an idle processor when faults affect service execution.
  • RestartNPU: directly resets processors and re-executes services.
  • SeparateNPU: isolates processors.
  • PreSeparateNPU: pre-isolates processors and determines whether to perform rescheduling based on the actual running status of the training job.
NOTE:
  • The functions of large_model_fault_level, fault_level and fault_handling are the same. fault_handling is recommended.
  • When an inference job subscribes to fault information, a RestartRequest fault on its inference card will not trigger rescheduling if the fault lasts 60 seconds or less. If the fault persists for more than 60 seconds, the related processor is isolated and the job is rescheduled.

- fault_handling

- large_model_fault_level

- fault_code

Fault code, a string of characters separated by commas (,).

String

  • Disconnected: The processor network is disconnected.
  • heartbeatTimeOut: The node status is lost.

remain-retry-times

Information about remaining jobs that can be rescheduled

-

-

- UUID

Job UID

String

-

- Times

Number of times that remaining jobs can be rescheduled

Integer

-