Quick Start
This section uses a single Atlas 800T A2 training server (functioning as both a management node and a compute node) as an example to describe how to quickly install NodeD, Ascend Device Plugin, Ascend Docker Runtime, Volcano, ClusterD, and Ascend Operator and use the full-NPU scheduling feature to quickly deliver training jobs.
Operation Description
Procedure |
Operation Description |
Reference |
|---|---|---|
This step uses an Atlas 800T A2 training server as an example to describe how to quickly install cluster scheduling components on Ascend devices. |
For more details, see Installation and Deployment. |
|
This step uses a simple PyTorch training job as an example to describe how to deliver a training job. |
For more details, see Basic Scheduling. |
Environment Setup
Before installing components, ensure that a cluster environment has been set up.
- Kubernetes has been installed on all nodes. The supported versions are 1.17.x to 1.34.x. If Volcano is required, install Kubernetes 1.19.x or later. For details about the Kubernetes version, refer to Kubernetes compatibility on the Volcano official website. To obtain the software package, visit the Kubernetes community.
- Docker has been installed on all nodes. The supported versions are 18.09.x to 28.5.1 To obtain the software package, visit the Docker community or the official website.
- The firmware and drivers have been installed on all nodes.
- Check whether npu-smi and hccn_tool can run properly on the host.
- Check whether the firmware and driver versions match cluster scheduling components. For details, see Ascend Training Solution Version Mapping.
- To query the NPU driver and firmware version, run the npu-smi info -t board -i NPU ID command. In the command output, the value of Software Version is the NPU driver version, and the value of Firmware Version is the NPU firmware version.
- You can run the npu-smi info command to query the number in the chip model name, which is indicated by the Name field in the returned message. As an example below, the value of {xxx} is 910.
Installing Components
This part uses an Atlas 800T A2 training server as an example. For details about the installation procedure and parameter description of all components, see Installation.
- Log in to the compute or management node as the root user and create component installation directories.
- Run the following commands in sequence to create installation directories on the compute node. The following directories are only examples.
mkdir /home/noded mkdir /home/devicePlugin mkdir /home/Ascend-docker-runtime
- Run the following commands in sequence to create installation directories on the management node. The following directories are only examples.
mkdir /home/ascend-volcano mkdir /home/ascend-operator mkdir /home/clusterd mkdir /home/noded mkdir /home/devicePlugin
- Run the following commands in sequence to create installation directories on the compute node. The following directories are only examples.
- Download the software packages of the corresponding architecture as required. The following uses the AArch64 architecture as an example.
- Run the following commands in sequence on the compute node to download the NodeD, Ascend Device Plugin, and Ascend Docker Runtime installation packages and decompress them:
cd /home/noded wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-noded_7.3.0_linux-aarch64.zip unzip Ascend-mindxdl-noded_7.3.0_linux-aarch64.zip cd /home/devicePlugin wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-device-plugin_7.3.0_linux-aarch64.zip unzip Ascend-mindxdl-device-plugin_7.3.0_linux-aarch64.zip cd /home/Ascend-docker-runtime wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-docker-runtime_7.3.0_linux-aarch64.run
- Run the following commands in sequence on the management node to download Volcano, ClusterD, and Ascend Operator installation packages:
cd /home/ascend-volcano wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-volcano_7.3.0_linux-aarch64.zip unzip Ascend-mindxdl-volcano_7.3.0_linux-aarch64.zip cd /home/ascend-operator wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-ascend-operator_7.3.0_linux-aarch64.zip unzip Ascend-mindxdl-ascend-operator_7.3.0_linux-aarch64.zip cd /home/clusterd wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-clusterd_7.3.0_linux-aarch64.zip unzip Ascend-mindxdl-clusterd_7.3.0_linux-aarch64.zip
- Run the following commands in sequence on the compute node to download the NodeD, Ascend Device Plugin, and Ascend Docker Runtime installation packages and decompress them:
- Build component images.
- Run the following command to pull the base image on the compute node:
docker pull ubuntu:22.04
- Run the following commands in sequence to pull the base image on the management node:
docker pull arm64v8/alpine:latest docker tag arm64v8/alpine:latest alpine:latest docker pull ubuntu:22.04
- Run the following commands in sequence to build component images on the compute node:
cd /home/noded docker build --no-cache -t noded:v7.3.0 ./ cd /home/devicePlugin docker build --no-cache -t ascend-k8sdeviceplugin:v7.3.0 ./
- Run the following commands in sequence to build component images on the management node:
cd /home/ascend-volcano/volcano-v1.7.0 docker build --no-cache -t volcanosh/vc-scheduler:v1.7.0 ./ -f ./Dockerfile-scheduler docker build --no-cache -t volcanosh/vc-controller-manager:v1.7.0 ./ -f ./Dockerfile-controller cd /home/ascend-operator docker build --no-cache -t ascend-operator:v7.3.0 ./ cd /home/clusterd docker build --no-cache -t clusterd:v7.3.0 ./
- Run the following command to pull the base image on the compute node:
- Create a node label.
- Run the following command on the Kubernetes management node to query the node name:
kubectl get node
Command output:1 2
NAME STATUS ROLES AGE VERSION worker01 Ready worker 23h v1.17.3
- Run the following commands in sequence to create a label (for example, worker01) for the compute node:
kubectl label nodes worker01 node-role.kubernetes.io/worker=worker kubectl label nodes worker01 workerselector=dls-worker-node kubectl label nodes worker01 host-arch=huawei-arm kubectl label nodes worker01 accelerator=huawei-Ascend910 kubectl label nodes worker01 accelerator-type=module-{xxx}b-8 # Enter the number that indicates the processor model. kubectl label nodes worker01 nodeDEnable=on
- Run the following command to create a label (for example, master01) for the management node:
kubectl label nodes master01 masterselector=dls-master-node
- Run the following command on the Kubernetes management node to query the node name:
- Create a user account.
- Run the following commands in sequence to create a user account on the compute node:
useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX usermod -a -G HwHiAiUser hwMindX
- Run the following command to create a user account on the management node:
useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
- Run the following commands in sequence to create a user account on the compute node:
- Create a log directory. Custom log directories are not supported.
- Run the following commands in sequence to create log directories on the compute node:
mkdir -m 755 /var/log/mindx-dl chown root:root /var/log/mindx-dl mkdir -m 750 /var/log/mindx-dl/devicePlugin chown root:root /var/log/mindx-dl/devicePlugin mkdir -m 750 /var/log/mindx-dl/noded chown hwMindX:hwMindX /var/log/mindx-dl/noded
- Run the following commands in sequence to create log directories on the management node.
mkdir -m 755 /var/log/mindx-dl chown root:root /var/log/mindx-dl mkdir -m 750 /var/log/mindx-dl/volcano-controller chown hwMindX:hwMindX /var/log/mindx-dl/volcano-controller mkdir -m 750 /var/log/mindx-dl/volcano-scheduler chown hwMindX:hwMindX /var/log/mindx-dl/volcano-scheduler mkdir -m 750 /var/log/mindx-dl/ascend-operator chown hwMindX:hwMindX /var/log/mindx-dl/ascend-operator mkdir -m 750 /var/log/mindx-dl/clusterd chown hwMindX:hwMindX /var/log/mindx-dl/clusterd
- Run the following commands in sequence to create log directories on the compute node:
- Run the following command on any node to create a namespace:
kubectl create ns mindx-dl
- Install components.
- Run the following commands in sequence to install Ascend Docker Runtime on the host of the compute node:
cd /home/Ascend-docker-runtime chmod u+x Ascend-docker-runtime_7.3.0_linux-aarch64.run ./Ascend-docker-runtime_7.3.0_linux-aarch64.run --install systemctl daemon-reload && systemctl restart docker
- Run the following commands in sequence to copy the component startup YAML files of the compute node to the installation directory of the corresponding component on the management node:
cd /home/noded scp noded-v7.3.0.yaml root@{IP_address_of_the_management_node}:/home/noded cd /home/devicePlugin scp device-plugin-volcano-v7.3.0.yaml root@{IP_address_of_the_management_node:/home/devicePlugin - Run the following commands in sequence on the management node to install components:
cd /home/ascend-operator kubectl apply -f ascend-operator-v7.3.0.yaml cd /home/ascend-volcano/volcano-v1.7.0 # Change v1.7.0 to v1.9.0 if Volcano 1.9.0 is required. kubectl apply -f volcano-v1.7.0.yaml cd /home/noded kubectl apply -f noded-v7.3.0.yaml cd /home/clusterd kubectl apply -f clusterd-v7.3.0.yaml cd /home/devicePlugin kubectl apply -f device-plugin-volcano-v7.3.0.yamlTake NodeD as an example. If the following information is displayed, the component is successfully installed.1 2 3 4
serviceaccount/noded created clusterrole.rbac.authorization.k8s.io/pods-noded-role created clusterrolebinding.rbac.authorization.k8s.io/pods-noded-rolebinding created daemonset.apps/noded created
- Run the following command on the management node to check whether the component is started:
kubectl get pod -n mindx-dl
Take NodeD as an example. If Running is displayed in the command output, the component is started successfully.
1 2 3 4
NAME READY STATUS RESTARTS AGE ... noded-fd6t8 1/1 Running 0 74s ...
- Run the following commands in sequence to install Ascend Docker Runtime on the host of the compute node:
Delivering a Training Job
- Build an image.
Download the ascend-pytorch training image of the 24.0.X version from the Ascend image repository based on the system architecture (ARM/x86_64). Change the default user in the container to root based on the training base image. The image does not contain files such as the training script and code. During training, those files are directly mapped to the container.
- Perform script adaptation.
- Download ResNet50_ID4149_for_PyTorch from the master branch in the PyTorch code repository and use it as the training code.
- Prepare a dataset corresponding to ResNet-50, and comply with corresponding specifications when using the dataset.
- Upload the dataset to the storage node as an administrator. Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet.
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet# pwd
- Decompress the training code downloaded in 1 to the local host, and upload the ModelZoo-PyTorch/PyTorch/built-in/cv/classification/ResNet50_ID4149_for_PyTorch directory in the decompressed training code to a directory in the environment, for example, /data/atlas_dls/public/code/.
- In the /data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch directory, comment out the following code in main.py:
def main(): args = parser.parse_args() os.environ['MASTER_ADDR'] = args.addr #os.environ['MASTER_PORT'] = '29501' # Comment out this line of code. if os.getenv('ALLOW_FP32', False) and os.getenv('ALLOW_HF32', False): raise RuntimeError('ALLOW_FP32 and ALLOW_HF32 cannot be set at the same time!') elif os.getenv('ALLOW_HF32', False): torch.npu.conv.allow_hf32 = True elif os.getenv('ALLOW_FP32', False): torch.npu.conv.allow_hf32 = False torch.npu.matmul.allow_hf32 = False - Go to the mindcluster-deploy repository, select a matched branch based on mindcluster-deploy Version Description, obtain the train_start.sh file from the samples/train/basic-training/without-ranktable/pytorch directory, and construct the following directory structure in the /data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts directory.
root@ubuntu:/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts# scripts/ ├── train_start.sh
- Prepare the job YAML file.
- Go to the mindcluster-deploy repository, select a matched branch based on mindcluster-deploy Version Description, and obtain the pytorch_standalone_acjob_{xxx}b.yaml file in the samples/train/basic-training/without-ranktable/pytorch directory. ({xxx} indicates the processor model.) A single-server single-processor job is presented in the example file by default.
- Modify the example YAML file and upload it to any file path. For details about the parameters in the YAML file, see Table 1.
apiVersion: mindxdl.gitee.com/v1 kind: AscendJob ... spec: ... replicaSpecs: Master: ... spec: nodeSelector: host-arch: huawei-arm accelerator-type: module-{xxx}b-8 # card-{xxx}b-2 is changed to module-{xxx}b-8, where {xxx} indicates the processor model. containers: - name: ascend image: pytorch-test:latest # Change the value to the image name obtained in Step 1. ... resources: limits: huawei.com/Ascend910: 1 requests: huawei.com/Ascend910: 1 ... volumes: - name: code nfs: # If the NFS service is not installed, change nfs to hostPath and delete server: 127.0.0.1. server: 127.0.0.1 path: "/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/" - name: data nfs: # If the NFS service is not installed, change nfs to hostPath and delete server: 127.0.0.1. server: 127.0.0.1 path: "/data/atlas_dls/public/dataset/" - name: output nfs: # If the NFS service is not installed, change nfs to hostPath and delete server: 127.0.0.1. server: 127.0.0.1 path: "/data/atlas_dls/output/" ...
- Run the following command to deliver a single-server single-processor job:
kubectl apply -f pytorch_standalone_acjob_{xxx}b.yaml - Run the following command to check the pod running status:
kubectl get pod --all-namespaces -o wide
A sample command output is as follows. If "Running" is displayed, the job is running properly.NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default default-test-pytorch-master-0 1/1 Running 0 6s 192.168.244.xxx worker01 <none> <none>
If the training job is always in the Pending state after being delivered, refer to Training Job Is in the Pending State Because "nodes are unavailable" or A Job Is Pending Due to Insufficient Resources to rectify the fault.
- View the training result.
- Run the following command on any node to view the training result:
kubectl logs -n Namespace_name Pod_name
Example:
kubectl logs -n default default-test-pytorch-master-0
- View the training logs. If the following information is displayed, the training is successful:
[20251218-20:31:57] [MindXDL Service Log]server id is: 0 /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=7 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0 /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=6 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0 /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=5 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0 /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=4 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0 /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=3 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0 /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=2 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0 /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=1 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0 /usr/local/python3.10.5/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source? warn( [2025-12-18 20:32:02] [WARNING] [470] profiler.py: Invalid parameter export_type: None, reset it to text. /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py:201: UserWarning: You have chosen to seed training. This will turn on the CUDNN deterministic setting, which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints. warnings.warn('You have chosen to seed training. ' /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py:208: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism. warnings.warn('You have chosen a specific GPU. This will completely ' Use GPU: 0 for training => creating model 'resnet50'
- Run the following command on any node to view the training result: