Fault Handling

Container Manager places the faulty processor and its associated processors in the buffer of processors to be reset if the RestartRequest and RestartBusiness faults last for 60 seconds or the FreeRestartNPU and RestartNPU faults are detected. Container Manager periodically attempts to reset the processors in the buffer. When the processors meet the following conditions, Container Manager calls the DCMI to reset them.

  • No job process exists on the faulty processor and its associated processors.
  • No running container occupies the faulty processor and its associated processors.
  • The faulty processor or its associated processors continue to exhibit faults at the RestartRequest, RestartBusiness, FreeRestartNPU, and RestartNPU levels.
  • After Container Manager successfully resets a processor and obtains it startup result within the specified period, the fault reset function is suspended for 30 seconds to allow processor initialization to complete.
  • After a processor fails to be reset for three consecutive times, Container Manager does not attempt to reset it.