Scenario Description

If a resource (for example, a node with the Ascend AI Processor installed and NodeD enabled) managed by the MindX DL cluster scheduling components becomes faulty, cluster scheduling components will isolate it (processor or node) and automatically reschedules and resumes the training job that is running when the fault occurs (script adaptation is required for resumable training).