Configuring Offline Reset for an Inference Job
Currently, offline reset is supported by the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server. After this function is enabled, if a processor is faulty, Ascend Device Plugin conducts a hot reset to restore it.
To enable offline reset for MindIE Motor inference jobs, you only need to set the startup parameter -hotReset of Ascend Device Plugin to 0 or 2.
Parameter |
Type |
Default Value |
Description |
|---|---|---|---|
-hotReset |
Integer |
-1 |
Whether to enable hot reset. After this function is enabled, if a processor is faulty, Ascend Device Plugin conducts a hot reset to restore it.
NOTE:
The value 1 cannot be used because the function has become unavailable. Set this parameter to other values. Supported training devices:
Supported inference devices:
|
There are two fault recovery modes for the Atlas 800I A2 inference server. One Atlas 800I A2 inference server can use only one fault recovery mode, which is automatically identified by cluster scheduling components.
- Mode 1: If no HCCS ring exists on the server, when an NPU is faulty during inference, Ascend Device Plugin waits until the NPU is idle and resets the NPU.
- Mode 2: If an HCCS ring exists on the server, when one or more NPUs are faulty during inference, Ascend Device Plugin waits until all NPUs on the ring are idle and resets them at a time.
Parent topic: Best Practices of MindIE Motor Inference Jobs