MindCluster Overview
MindCluster (AI cluster system software) is a deep learning component designed for NPUs (Ascend AI Processors), providing cluster-level solutions for training and inference tasks. By streamlining the development of underlying resource scheduling software, it enables partners to quickly build deep learning platforms based on MindCluster.

Resilience Controller and Elastic Agent have reached the end of life. The Resilience Controller content will be deleted on the 30th of September, 2026, and the Elastic Agent content will be deleted on the 30th of December, 2026.
MindCluster Features
Key Feature |
Description |
Link |
|---|---|---|
Installation and deployment |
Provides methods of downloading and installing Ascend software and dependencies online, and verifying their signatures. |
|
Performance test |
Provides functions such as Atlas hardware compatibility check, performance test, and fault diagnostics. |
|
Fault diagnostics |
Provides log cleaning and fault diagnostics for training and inference jobs to locate root causes of failures. |
|
Cluster scheduling |
Schedules and manages NPU resources, generates collective communication configurations for distributed training, and supports resumable training. |
MindCluster Components
Component |
Function Description |
|---|---|
MindCluster Ascend Deployer |
Supports automatic download and one-click installation of Ascend software and dependencies, and network configuration on the parameter plane. |
Component |
Function Description |
|---|---|
Ascend DMI |
Provides functions such as compatibility check, bandwidth test, computing power test, power consumption test, and diagnosis pressure test for Atlas hardware. |
Ascend Cert |
Provides functions such as digital signature verification of software packages and CRL update to ensure the security of software packages and validity of CRL files. |
Component |
Function Description |
|---|---|
MindCluster Ascend FaultDiag |
Provides log cleaning and fault diagnosis functions, extracts key information about logs related to training and inference, and analyzes the root cause node and fault based on the key information after cleaning on all nodes in a cluster. |
Component |
Function Description |
|---|---|
Ascend Docker Runtime |
Provides containerized support for training and inference jobs and automatically mounts required files and device dependencies. |
Ascend Device Plugin |
Provides device discovery, allocation, and health status reporting functions of Ascend AI Processors based on the Kubernetes device plugin mechanism, to enable Kubernetes to manage Ascend AI Processor resources. |
NPU Exporter |
Monitors resource metrics of Ascend AI Processors in real time to obtain information such as the usage, temperature, and voltage of Ascend AI Processors. |
Volcano |
Adds features such as affinity scheduling and re-scheduling upon faults based on the open source Volcano scheduling plugin mechanism, to maximize the computing performance of Ascend AI Processors. |
ClusterD |
Provides information about available resources at the cluster level, and collects and analyzes information about cluster tasks, resources, faults, and impact scope. |
Ascend Operator |
Manages training jobs, provides environment variables for distributed training jobs of different AI frameworks, and generates collective communication configurations on which distributed training jobs depend. |
NodeD |
Reports node status, such as node health status and CPU and memory faults. |
Resilience Controller |
Provides elastic scale-in training services. When the hardware used by a training job is faulty, the hardware can be removed to continue the training. |
Elastic Agent |
Saves the last checkpoint when a training job is faulty. |
TaskD |
Provides training and inference job status monitoring and control capabilities on Ascend devices. |
MindIO ACP |
Uses the training server memory as the cache in foundation model training, to accelerate checkpoint saving and loading. |
MindIO TFT |
Provides the Try to Persist (TTP), Uncorrectable Memory Error (UCE), and Air Refueling (ARF) functions. |
Container Manager |
Provides the service container restoration capability in scenarios where Kubernetes is not available. This function is mainly used in the all-in-one appliance. |