MindCluster Overview

MindCluster (AI cluster system software) is a deep learning component designed for NPUs (Ascend AI Processors), providing cluster-level solutions for training and inference tasks. By streamlining the development of underlying resource scheduling software, it enables partners to quickly build deep learning platforms based on MindCluster.

Figure 1 MindCluster stack

Resilience Controller and Elastic Agent have reached the end of life. The Resilience Controller content will be deleted on the 30th of September, 2026, and the Elastic Agent content will be deleted on the 30th of December, 2026.

MindCluster Features

Key Feature

Description

Link

Installation and deployment

Provides methods of downloading and installing Ascend software and dependencies online, and verifying their signatures.

Installation and Deployment

Performance test

Provides functions such as Atlas hardware compatibility check, performance test, and fault diagnostics.

Performance Test

Fault diagnostics

Provides log cleaning and fault diagnostics for training and inference jobs to locate root causes of failures.

Fault Diagnosis

Cluster scheduling

Schedules and manages NPU resources, generates collective communication configurations for distributed training, and supports resumable training.

Cluster Scheduling

MindCluster Components

Table 1 Installation and deployment components

Component

Function Description

MindCluster Ascend Deployer

Supports automatic download and one-click installation of Ascend software and dependencies, and network configuration on the parameter plane.

Table 2 ToolBox components

Component

Function Description

Ascend DMI

Provides functions such as compatibility check, bandwidth test, computing power test, power consumption test, and diagnosis pressure test for Atlas hardware.

Ascend Cert

Provides functions such as digital signature verification of software packages and CRL update to ensure the security of software packages and validity of CRL files.

Table 3 Fault diagnosis components

Component

Function Description

MindCluster Ascend FaultDiag

Provides log cleaning and fault diagnosis functions, extracts key information about logs related to training and inference, and analyzes the root cause node and fault based on the key information after cleaning on all nodes in a cluster.

Table 4 Cluster scheduling components

Component

Function Description

Ascend Docker Runtime

Provides containerized support for training and inference jobs and automatically mounts required files and device dependencies.

Ascend Device Plugin

Provides device discovery, allocation, and health status reporting functions of Ascend AI Processors based on the Kubernetes device plugin mechanism, to enable Kubernetes to manage Ascend AI Processor resources.

NPU Exporter

Monitors resource metrics of Ascend AI Processors in real time to obtain information such as the usage, temperature, and voltage of Ascend AI Processors.

Volcano

Adds features such as affinity scheduling and re-scheduling upon faults based on the open source Volcano scheduling plugin mechanism, to maximize the computing performance of Ascend AI Processors.

ClusterD

Provides information about available resources at the cluster level, and collects and analyzes information about cluster tasks, resources, faults, and impact scope.

Ascend Operator

Manages training jobs, provides environment variables for distributed training jobs of different AI frameworks, and generates collective communication configurations on which distributed training jobs depend.

NodeD

Reports node status, such as node health status and CPU and memory faults.

Resilience Controller

Provides elastic scale-in training services. When the hardware used by a training job is faulty, the hardware can be removed to continue the training.

Elastic Agent

Saves the last checkpoint when a training job is faulty.

TaskD

Provides training and inference job status monitoring and control capabilities on Ascend devices.

MindIO ACP

Uses the training server memory as the cache in foundation model training, to accelerate checkpoint saving and loading.

MindIO TFT

Provides the Try to Persist (TTP), Uncorrectable Memory Error (UCE), and Air Refueling (ARF) functions.

Container Manager

Provides the service container restoration capability in scenarios where Kubernetes is not available. This function is mainly used in the all-in-one appliance.