Recovery of Inference Card Faults

The recovery of inference card faults feature needs to be used together with the full NPU scheduling feature. To enable the recovery of inference card faults, you only need to set the startup parameter -hotReset of Ascend Device Plugin to 0 or 2. (The default value is -1, indicating that the fault recovery function is not supported.) For details, see Full NPU Scheduling or Static vNPU Scheduling (Inference).

When this feature is enabled on the Atlas 800I A2 inference server and A200I A2 Box heterogeneous component, only single-server single-processor jobs can be delivered. Distributed jobs are not supported. In addition, infer-vcjob-910-hotreset.yaml needs to be used to deliver jobs.

There are two fault recovery modes for the Atlas 800I A2 inference server. One Atlas 800I A2 inference server can use only one fault recovery mode, which is automatically identified by cluster scheduling components.

  • Mode 1: If no HCCS ring exists on the server, when an NPU is faulty during inference, Ascend Device Plugin waits until the NPU is idle and resets it.
  • Mode 2: If an HCCS ring exists on the server, when one or more NPUs are faulty during inference, Ascend Device Plugin waits until all NPUs on the ring are idle and resets them at a time.