MindX DL Architecture

Figure 1 shows the system architecture of MindX DL.

Figure 1 MindX DL architecture
Table 1 Functions of components

Feature

Component

Function Description

Cluster Scheduling

Ascend Device Plugin

Provides the device discovery, allocation, and health status reporting functions of Ascend AI Processors based on the Kubernetes device plugin mechanism, enabling Kubernetes to manage Ascend AI Processor resources.

HCCL-Controller

HCCL-Controller is a Huawei-developed component used for Ascend AI Processor training jobs. It uses the Kubernetes informer mechanism to continuously record training jobs and various events of pods, read the Ascend AI Processor information of pods, and generate the corresponding ConfigMap. The ConfigMap contains the HCCL configuration on which training jobs depend, facilitating better collaboration and scheduling of the underlying Ascend AI Processors. No manual configuration is required.

Volcano

Based on the open source Volcano scheduling plugin mechanism, Volcano adds features such as affinity scheduling and fault re-scheduling to maximize the computing performance of Ascend AI Processors.

NPU-Exporter

As a Prometheus ecosystem component, NPU-Exporter manages the metrics of Ascend AI Processor resources and obtains information such as the Ascend AI Processor usage, temperature, voltage, memory, and allocation in containers in real time.

NodeD

Provides functions to report node status, such as node heartbeat.

Elastic-Agent

Provides functions such as the dying gasp (the dying gasp CKPT) for resumable training and restoration policy in data parallel and hybrid parallel scenarios. To use the dying gasp function, you need to install this component in the training container.

Resilience-Controller

Provides resilience control for the minimum training system. When the hardware used by a training job is faulty, the hardware is removed to continue the training.

Ascend Docker Runtime

As a container engine plugin, Ascend Docker Runtime provides NPU-based containerization support for all AI jobs so that AI jobs can run smoothly on Ascend devices as Docker containers.

Model Protection

AI-GUARD

An encrypted model is transmitted and stored in ciphertext, and can be directly deployed in the production environment. You can use the ciphertext model to perform inference to ensure the model confidentiality during transmission and storage.

Crypto_fs

Provides imperceptible decryption and model access control functions. Applications can access ciphertext models based on Crypto_fs, without sensing the model decryption process. Crypto_fs performs access control on the decrypted models to prevent unauthorized accesses by other programs.

AI-VAULT

Manages master keys and pre-shared keys, and supports high-concurrency specifications and large-scale clusters. When an application accesses a model, it provides a key decryption API for Crypto_fs to automatically decrypt the model.

Toolbox

Ascend-DMI

Provides functions such as bandwidth test, computing power test, and power consumption test, for standard PCIe cards, board cards, and modules of Atlas products.

AtlasCert

Verifies software package data signatures and compares and updates CRLs to ensure security of software packages and validity of CRL files.