MindX DL Architecture

Figure 1 shows the system architecture of MindX DL.

Figure 1 MindX DL architecture

**Table 1** Functions of components
Feature	Component	Function Description
Cluster Scheduling	Ascend Device Plugin	Provides the device discovery, allocation, and health status reporting functions of Ascend AI Processors based on the Kubernetes device plugin mechanism, enabling Kubernetes to manage Ascend AI Processor resources.
	HCCL-Controller	HCCL-Controller is a Huawei-developed component used for Ascend AI Processor training jobs. It uses the Kubernetes informer mechanism to continuously record training jobs and various events of pods, read the Ascend AI Processor information of pods, and generate the corresponding ConfigMap. The ConfigMap contains the HCCL configuration on which training jobs depend, facilitating better collaboration and scheduling of the underlying Ascend AI Processors. No manual configuration is required.
	Volcano	Based on the open source Volcano scheduling plugin mechanism, Volcano adds features such as affinity scheduling and fault re-scheduling to maximize the computing performance of Ascend AI Processors.
	NPU-Exporter	As a Prometheus ecosystem component, NPU-Exporter manages the metrics of Ascend AI Processor resources and obtains information such as the Ascend AI Processor usage, temperature, voltage, memory, and allocation in containers in real time.
	NodeD	Provides functions to report node status, such as node heartbeat.
	Elastic-Agent	Provides functions such as the dying gasp (the dying gasp CKPT) for resumable training and restoration policy in data parallel and hybrid parallel scenarios. To use the dying gasp function, you need to install this component in the training container.
	Resilience-Controller	Provides resilience control for the minimum training system. When the hardware used by a training job is faulty, the hardware is removed to continue the training.
	Ascend Docker Runtime	As a container engine plugin, Ascend Docker Runtime provides NPU-based containerization support for all AI jobs so that AI jobs can run smoothly on Ascend devices as Docker containers.
Model Protection	AI-GUARD	An encrypted model is transmitted and stored in ciphertext, and can be directly deployed in the production environment. You can use the ciphertext model to perform inference to ensure the model confidentiality during transmission and storage.
	Crypto_fs	Provides imperceptible decryption and model access control functions. Applications can access ciphertext models based on Crypto_fs, without sensing the model decryption process. Crypto_fs performs access control on the decrypted models to prevent unauthorized accesses by other programs.
	AI-VAULT	Manages master keys and pre-shared keys, and supports high-concurrency specifications and large-scale clusters. When an application accesses a model, it provides a key decryption API for Crypto_fs to automatically decrypt the model.
Toolbox	Ascend-DMI	Provides functions such as bandwidth test, computing power test, and power consumption test, for standard PCIe cards, board cards, and modules of Atlas products.
Toolbox	AtlasCert	Verifies software package data signatures and compares and updates CRLs to ensure security of software packages and validity of CRL files.