Before You Start

If an inference processor resource managed by the cluster scheduling components is faulty, the components can isolate faulty resource and automatically reschedule it.

Prerequisites

  • To enable rescheduling upon inference card faults, ensure that the following components have been installed:
    • Volcano (This feature supports only Volcano as the scheduler.)
    • Ascend Device Plugin
    • Ascend Docker Runtime
    • Ascend Operator
    • ClusterD
    • NodeD
  • If the preceding components are not installed, refer to Installation and Deployment for further operations.

Usage Modes

Modes of rescheduling upon inference card faults are as follows:

  • Use on the CLI: Install cluster scheduling components and enable rescheduling upon inference card faults through the CLI.
  • Use after integration: Integrate the cluster scheduling components into an existing third-party AI platform or an AI platform developed based on the cluster scheduling components.

Instructions

  • Resource monitoring can be used together with all features in inference scenarios.
  • Multiple inference jobs can be run in a cluster at the same time. Each job can use different features, but jobs that support static vNPUs and jobs that support dynamic vNPUs cannot coexist.
  • By default, full NPU scheduling is used for rescheduling upon inference card faults. Static vNPU scheduling is not supported, while dynamic vNPU scheduling is supported by the , Atlas inference product.
  • Rescheduling upon inference card faults can deliver single-server jobs with a single replica or multiple replicas. Each replica works independently. Only distributed jobs of the acjob type can be deployed on the inference server (equipped with Atlas 300I Duo inference cards), A200I A2 Box heterogeneous component, and Atlas 800I A2 inference server.
  • This feature supports vcjob and Deployment jobs. You need to add the fault-scheduling label for fault rescheduling and set it to grace or force.

Supported Products

The following products support rescheduling upon inference card faults:
  • Inference server (equipped with Atlas 300I inference cards)
  • Atlas inference product
  • Atlas 800I A2 inference server
  • A200I A2 Box heterogeneous component
  • Atlas 800I A3 SuperPoD Server

Usage Process

For details about how to enable rescheduling upon inference card faults through CLI, see Figure 1.

Figure 1 Usage process