Before You Start
Before installing components, read Introduction to understand the functions of each cluster scheduling component and select proper components to be installed based on their features.
Elastic Agent and TaskD must be deployed in a container. For details about the installation procedure, see Image Creation.
Resilience Controller and Elastic Agent have reached the end of life. The Resilience Controller documentation will be deleted on the 30th of September, 2026, and the Elastic Agent documentation will be deleted on the 30th of December, 2026.
Restrictions
- Ensure that the root directory has sufficient drive space. If the drive usage of the root directory is higher than 85%, the kubelet resource eviction mechanism is triggered and the service becomes unavailable. For details about the drive space requirements, see Table 1. For details about the eviction policy, see the official Kubernetes documentation.
- To ensure the normal installation and use of MindCluster cluster scheduling components, the system time of different training servers within a cluster must be the same.
- Images of cluster scheduling components used by the ARM and x86_64 architectures are incompatible.
- The default validity period of the Kubernetes certificate is 365 days. Update the certificate before it expires.
Component Deployment Description
When installing and deploying cluster scheduling components, refer to Figure 1 to install them or other third-party software on the corresponding nodes. Most components are deployed in containerized mode. Ascend Docker Runtime is deployed in binary mode. Only NPU Exporter can be deployed in both containerized and binary modes.
MindClusterVolcano integrates Ascend-volcano-plugin based on the open-source Volcano.
Log Path Description
- The log file path of Ascend Docker Runtime is /var/log/ascend-docker-runtime/.
- For details about the log paths of other cluster scheduling components, see Creating a Log Directory.
