Installing and Deploying a Large-Scale Cluster

MindCluster Ascend Deployer can install and deploy a large-scale cluster with more than 100,000 cards in one day through multi-instance deployment.

To improve the deployment efficiency, a large-scale cluster is divided into multiple sub-clusters. One of the sub-clusters is selected as the instance node, which is used as the distributed deployment node. The master node collects statistics on the deployment information of all instance nodes.

Multi-instance deployment cannot be performed in two OSs at the same time.

Run the bash large_scale_install.sh --install command in the ascend_deployer directory when MindCluster Ascend Deployer is installed by decompressing the downloaded ZIP package.

bash large_scale_install.sh --install=<package_name_1>,<package_name_2>

See Table 1 for the example.

Install the packages in the following sequence: sys_pkg > Python > NPU > CANN, MindCluster (performance test, fault diagnosis, and cluster scheduling components). During the installation, the version of the CANN package in the resources directory must match that of the NPU.

**Table 1** Installation commands
Type	Installation Command
OS initialization (sys_pkg installation)	bash large_scale_install.sh --install=sys_pkg # When running the --install command, do not install sys_pkg repeatedly.
OS initialization (Python installation)	bash large_scale_install.sh --install=python
NPU firmware and driver (either of the listed commands)	bash large_scale_install.sh --install=npu
NPU firmware and driver (either of the listed commands)	bash large_scale_install.sh --install=driver,firmware
CANN (training, inference, and development and debugging)	bash large_scale_install.sh --install=kernels,toolkit
CANN (edge inference)	bash large_scale_install.sh --install=nnrt,kernels
CANN (training and inference)	bash large_scale_install.sh --install=nnae,kernels
MindCluster cluster scheduling	bash large_scale_install.sh --install=ascend-device-plugin,ascend-docker-runtime,noded,npu-exporter,volcano,ascend-operator,clusterd,resilience-controller
MindCluster performance test	bash large_scale_install.sh --install=toolbox
MindCluster fault diagnosis	bash large_scale_install.sh --install=fault-diag

(Optional) If CANN and ToolBox need to be installed, sign Huawei Enterprise End User License Agreement (EULA) to proceed to the installation process and enter y or Y to accept the agreement (entering any other character means a refusal). After you accept the agreement, the installation automatically starts.
If the current language environment does not meet the requirements, run the following command to configure the default language environment:
- Chinese
```
export LANG=zh_CN.UTF-8
```
- English
```
export LANG=en_US.UTF-8
```

Viewing the Installation Reports, Test Report, and Status Information

If the installation fails, the report directory is generated in the ~/.ascend_deployer/large_scale_deploy/ directory, containing the installation report files large_scale_deploy.json and host_deploy_report.csv. The report files record the IP address and status of each server.

The installation progress file deployer_progress_output.json is generated in /root/.ascend_deployer/large_scale_deploy/remote_host_data/{IP}/. You can view the installation process and status information in the file.

Run the following command to generate the test report test_report.csv in the report directory:

bash large_scale_install.sh --test=all

The report file records the IP address and Ascend software version of each server.

Configuring the Large-Scale Deployment File

Configure the IP addresses and usernames of target devices on the MindCluster Ascend Deployer executor.

Go to the ascend-deployer/ascend_deployer directory, open large_scale_inventory.ini to add the configuration, and run the :wq command to save the file and exit.

Set variables for the master, worker, deploy_node (optional), and npu_node (optional) based on Table 2.

The Kubernetes version must be 1.28 or later.
When configuring MindCluster cluster scheduling for a Kubernetes cluster, you need to pay attention to the different NPU hardware forms in the cluster and provide information about the feature server of different hardware. You can configure npu_node to specify the feature server of different hardware. If npu_node is not configured, the NPU hardware type of the first node configured under worker in large_scale_inventory.ini is used by default. npu_node must belong to the worker group, and providing only the IP address is sufficient. Parameters in the worker group will overwrite the npu_node configuration based on the IP address.

**Table 2** Parameters
Parameter	Required or Not	Description
IP	Yes	Server IP address.
ansible_ssh_user	Yes	Account for logging in to a remote server using SSH. The account must be root.
ansible_ssh_pass	No	Password for logging in to a remote server using SSH. If SSH key-based authentication is configured and the root user is allowed for login, you do not need to set this parameter. NOTE: In large-scale deployment, password-free needs to be configured between nodes within the cluster.
ansible_ssh_port	No	Port for SSH connection. You do not need to set this parameter when the default port number 22 is used. If a non-default port is used, you need to configure this parameter.
set_hostname	No Required when there are multiple master nodes or worker nodes. Optional when there is only one node.	Name of a node in a Kubernetes cluster. You are advised to use the master-1 or worker-1 format to fill in the names in sequence. If a Kubernetes cluster exists, the parameter value must be the name of the node in the cluster. The name must be in lowercase and cannot be entered arbitrarily.
npu_num	No	Number of NPUs.
index	No	Sequence number (string type) of the server within the IP address segment, for example, *1.1.1.1-1.1.1.3 set_hostname="master-{index}", where 1.1.1.1* corresponds to set_hostname=master-1.
step_len	No	IP address step. In actual deployments, IP address sequences may contain gaps. For example, if the IP addresses are 1.1.1.1, 1.1.1.3, and 1.1.1.5, the step is 2. If 1.1.1.1-1.1.1.5 step_len=2 is set, 1.1.1.1,1.1.1.3,1.1.1.5 is output. If 1.1.1.1-1.1.1.6 step_len=2 is set, 1.1.1.1,1.1.1.3,1.1.1.5,1.1.1.6 is output. That is, the last IP address is retained even if it is not within the step.
Notes Configure the host of the master node as the default Kubernetes controller node. The number of master nodes must be an odd number. Parameters can be parsed in the batch input configuration {}. Basic mathematical operations and conversion to the string or integer type are supported. The final output of the {} parsing is a string. For example, if 1.1.1.1-1.1.1.3 set_hostname="master-{ str(int(index)+int('20')) + 'x'}" is entered, the parsed host information is 1.1.1.1 set_hostname="master-21x" 1.1.1.2 set_hostname="master-22x" ….

Configure the large-scale deployment parameter in the [large_scale] field.

**Table 3** Parameter
Parameter	Required or Not	Description
SUB_GROUP_MAX_SIZE	No	Maximum size of a sub-cluster. Ensure that the size of each sub-cluster is less than or equal to the value of this parameter. The value is of the integer type. The default value is 200.

Example:

[master]

[worker]
xx.xxx.xx.x1-xx.xxx.xx.x9 ansible_ssh_user="root" ansible_ssh_pass="xxxxxxx"  step_len=3 set_hostname="master-{ip}-{int(index)+1}-y"

[deploy_node]
10.1.1.1
  
[npu_node]

[large_scale]
SUB_GROUP_MAX_SIZE=5
[all:vars]

Configure deploy_node. Two modes are supported:
1. Manual setting: Manually sets the parameter. This mode takes the priority.
2. Automatic selection: Automatically selects a node if no host in the [deploy_node] host group is manually specified.
  The selection principle is as follows:
  
  If the maximum value of SUB_GROUP_MAX_SIZE in the [large_scale] field in large_scale_inventory.ini is 200, the tool automatically sorts the IP addresses in sequence, divides the IP addresses into sub-clusters (a maximum of 200 IP addresses can be grouped into a sub-cluster), and uses the first server in each sub-cluster as the instance node to install and deploy the Ascend software.

For Atlas A2 training products, both IPv4 and IPv6 addresses are supported. The type of IP addresses used by an SSH client such as PuTTY to connect to the execution device must be the same as that configured in large_scale_inventory.ini, which should be either IPv4 or IPv6. On other devices, only IPv4 addresses are supported.

Parent topic: Common Operations