MindX DL Architecture
Figure 1 shows the system architecture of MindX DL.
Feature |
Component |
Function Description |
|---|---|---|
Cluster Scheduling |
Ascend Device Plugin |
Provides the device discovery, allocation, and health status reporting functions of Ascend AI Processors based on the Kubernetes device plugin mechanism, enabling Kubernetes to manage Ascend AI Processor resources. |
HCCL-Controller |
HCCL-Controller is a Huawei-developed component used for Ascend AI Processor training jobs. It uses the Kubernetes informer mechanism to continuously record training jobs and various events of pods, read the Ascend AI Processor information of pods, and generate the corresponding ConfigMap. The ConfigMap contains the HCCL configuration on which training jobs depend, facilitating better collaboration and scheduling of the underlying Ascend AI Processors. No manual configuration is required. |
|
Volcano |
Based on the open source Volcano scheduling plugin mechanism, Volcano adds features such as affinity scheduling and fault re-scheduling to maximize the computing performance of Ascend AI Processors. |
|
NPU-Exporter |
As a Prometheus ecosystem component, NPU-Exporter manages the metrics of Ascend AI Processor resources and obtains information such as the Ascend AI Processor usage, temperature, voltage, memory, and allocation in containers in real time. |
|
NodeD |
Provides functions to report node status, such as node heartbeat. |
|
Elastic-Agent |
Provides functions such as the dying gasp (the dying gasp CKPT) for resumable training and restoration policy in data parallel and hybrid parallel scenarios. To use the dying gasp function, you need to install this component in the training container. |
|
Resilience-Controller |
Provides resilience control for the minimum training system. When the hardware used by a training job is faulty, the hardware is removed to continue the training. |
|
Ascend Docker Runtime |
As a container engine plugin, Ascend Docker Runtime provides NPU-based containerization support for all AI jobs so that AI jobs can run smoothly on Ascend devices as Docker containers. |
|
Model Protection |
AI-GUARD |
An encrypted model is transmitted and stored in ciphertext, and can be directly deployed in the production environment. You can use the ciphertext model to perform inference to ensure the model confidentiality during transmission and storage. |
Crypto_fs |
Provides imperceptible decryption and model access control functions. Applications can access ciphertext models based on Crypto_fs, without sensing the model decryption process. Crypto_fs performs access control on the decrypted models to prevent unauthorized accesses by other programs. |
|
AI-VAULT |
Manages master keys and pre-shared keys, and supports high-concurrency specifications and large-scale clusters. When an application accesses a model, it provides a key decryption API for Crypto_fs to automatically decrypt the model. |
|
Toolbox |
Ascend-DMI |
Provides functions such as bandwidth test, computing power test, and power consumption test, for standard PCIe cards, board cards, and modules of Atlas products. |
AtlasCert |
Verifies software package data signatures and compares and updates CRLs to ensure security of software packages and validity of CRL files. |
