Configuring Offline Reset for an Inference Job

Currently, offline reset is supported by the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server. After this function is enabled, if a processor is faulty, Ascend Device Plugin conducts a hot reset to restore it.

To enable offline reset for MindIE Motor inference jobs, you only need to set the startup parameter -hotReset of Ascend Device Plugin to 0 or 2.

Table 1 Parameter description

Parameter

Type

Default Value

Description

-hotReset

Integer

-1

Whether to enable hot reset. After this function is enabled, if a processor is faulty, Ascend Device Plugin conducts a hot reset to restore it.
  • -1: disables processor reset.
  • 0: resets inference devices.
  • 1: resets training devices online.
  • 2: resets training/inference devices offline.
NOTE:

The value 1 cannot be used because the function has become unavailable. Set this parameter to other values.

Supported training devices:
  • Atlas 800 training server (model 9000) (fully populated with NPUs)
  • Atlas 800 training server (model 9010) (fully populated with NPUs)
  • Atlas 900T PoD Lite
  • Atlas 900 PoD (model 9000)
  • Atlas 800T A2 training server
  • Atlas 900 A2 PoD cluster basic unit
  • Atlas 900 A3 SuperPoD
  • Atlas 800T A3 SuperPoD Server
Supported inference devices:
  • Atlas 300I Pro inference card
  • Atlas 300V video analysis card
  • Atlas 300V Pro video analysis card
  • Atlas 300I Duo inference card

  • Atlas 300I inference card (model 3000) (entire card)
  • Atlas 300I inference card (model 3010)
  • Atlas 800I A2 inference server
  • A200I A2 Box heterogeneous component
  • Atlas 800I A3 SuperPoD Server

There are two fault recovery modes for the Atlas 800I A2 inference server. One Atlas 800I A2 inference server can use only one fault recovery mode, which is automatically identified by cluster scheduling components.

  • Mode 1: If no HCCS ring exists on the server, when an NPU is faulty during inference, Ascend Device Plugin waits until the NPU is idle and resets the NPU.
  • Mode 2: If an HCCS ring exists on the server, when one or more NPUs are faulty during inference, Ascend Device Plugin waits until all NPUs on the ring are idle and resets them at a time.