Quick Start

This section uses a single Atlas 800T A2 training server (functioning as both a management node and a compute node) as an example to describe how to quickly install NodeD, Ascend Device Plugin, Ascend Docker Runtime, Volcano, ClusterD, and Ascend Operator and use the full-NPU scheduling feature to quickly deliver training jobs.

Operation Description

**Table 1** Key steps
Procedure	Operation Description	Reference
Installing Components	This step uses an Atlas 800T A2 training server as an example to describe how to quickly install cluster scheduling components on Ascend devices.	For more details, see Installation and Deployment.
Delivering a Training Job	This step uses a simple PyTorch training job as an example to describe how to deliver a training job.	For more details, see Basic Scheduling.

Environment Setup

Before installing components, ensure that a cluster environment has been set up.

Kubernetes has been installed on all nodes. The supported versions are 1.17.x to 1.34.x. If Volcano is required, install Kubernetes 1.19.x or later. For details about the Kubernetes version, refer to Kubernetes compatibility on the Volcano official website. To obtain the software package, visit the Kubernetes community.
Docker has been installed on all nodes. The supported versions are 18.09.x to 28.5.1 To obtain the software package, visit the Docker community or the official website.
The firmware and drivers have been installed on all nodes.
Check whether npu-smi and hccn_tool can run properly on the host.
- Check whether the firmware and driver versions match cluster scheduling components. For details, see Ascend Training Solution Version Mapping.
- To query the NPU driver and firmware version, run the npu-smi info -t board -i NPU ID command. In the command output, the value of Software Version is the NPU driver version, and the value of Firmware Version is the NPU firmware version.
- You can run the npu-smi info command to query the number in the chip model name, which is indicated by the Name field in the returned message. As an example below, the value of {xxx} is 910.

Installing Components

This part uses an Atlas 800T A2 training server as an example. For details about the installation procedure and parameter description of all components, see Installation.

Log in to the compute or management node as the root user and create component installation directories.
1. Run the following commands in sequence to create installation directories on the compute node. The following directories are only examples.
```
mkdir /home/noded
mkdir /home/devicePlugin
mkdir /home/Ascend-docker-runtime
```
2. Run the following commands in sequence to create installation directories on the management node. The following directories are only examples.
```
mkdir /home/ascend-volcano
mkdir /home/ascend-operator
mkdir /home/clusterd
mkdir /home/noded
mkdir /home/devicePlugin
```

Download the software packages of the corresponding architecture as required. The following uses the AArch64 architecture as an example.

Run the following commands in sequence on the compute node to download the NodeD, Ascend Device Plugin, and Ascend Docker Runtime installation packages and decompress them:

cd /home/noded
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-noded_7.3.0_linux-aarch64.zip
unzip Ascend-mindxdl-noded_7.3.0_linux-aarch64.zip

cd /home/devicePlugin
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-device-plugin_7.3.0_linux-aarch64.zip
unzip Ascend-mindxdl-device-plugin_7.3.0_linux-aarch64.zip

cd /home/Ascend-docker-runtime
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-docker-runtime_7.3.0_linux-aarch64.run

Run the following commands in sequence on the management node to download Volcano, ClusterD, and Ascend Operator installation packages:

cd /home/ascend-volcano
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-volcano_7.3.0_linux-aarch64.zip
unzip Ascend-mindxdl-volcano_7.3.0_linux-aarch64.zip

cd /home/ascend-operator
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-ascend-operator_7.3.0_linux-aarch64.zip
unzip Ascend-mindxdl-ascend-operator_7.3.0_linux-aarch64.zip

cd /home/clusterd
wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-clusterd_7.3.0_linux-aarch64.zip
unzip Ascend-mindxdl-clusterd_7.3.0_linux-aarch64.zip

Build component images.

Run the following command to pull the base image on the compute node:
```
docker pull ubuntu:22.04
```

Run the following commands in sequence to pull the base image on the management node:

docker pull arm64v8/alpine:latest
docker tag arm64v8/alpine:latest alpine:latest
docker pull ubuntu:22.04

Run the following commands in sequence to build component images on the compute node:

cd /home/noded
docker build --no-cache -t noded:v7.3.0 ./

cd /home/devicePlugin
docker build --no-cache -t ascend-k8sdeviceplugin:v7.3.0 ./

Run the following commands in sequence to build component images on the management node:

cd /home/ascend-volcano/volcano-v1.7.0
docker build --no-cache -t volcanosh/vc-scheduler:v1.7.0 ./ -f ./Dockerfile-scheduler
docker build --no-cache -t volcanosh/vc-controller-manager:v1.7.0 ./ -f ./Dockerfile-controller

cd /home/ascend-operator
docker build --no-cache -t ascend-operator:v7.3.0 ./

cd /home/clusterd
docker build --no-cache -t clusterd:v7.3.0 ./

Create a node label.

Run the following command on the Kubernetes management node to query the node name:

kubectl get node

Command output:

NAME       STATUS   ROLES           AGE   VERSION
worker01   Ready    worker    23h   v1.17.3

Run the following commands in sequence to create a label (for example, worker01) for the compute node:

kubectl label nodes worker01 node-role.kubernetes.io/worker=worker
kubectl label nodes worker01 workerselector=dls-worker-node
kubectl label nodes worker01 host-arch=huawei-arm
kubectl label nodes worker01 accelerator=huawei-Ascend910
kubectl label nodes worker01 accelerator-type=module-{xxx}b-8     # Enter the number that indicates the processor model.
kubectl label nodes worker01 nodeDEnable=on

Run the following command to create a label (for example, master01) for the management node:
```
kubectl label nodes master01 masterselector=dls-master-node
```

Create a user account.
1. Run the following commands in sequence to create a user account on the compute node:
```
useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
usermod -a -G HwHiAiUser hwMindX
```
2. Run the following command to create a user account on the management node:
```
useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
```

Create a log directory. Custom log directories are not supported.

Run the following commands in sequence to create log directories on the compute node:

mkdir -m 755 /var/log/mindx-dl
chown root:root /var/log/mindx-dl
mkdir -m 750 /var/log/mindx-dl/devicePlugin
chown root:root /var/log/mindx-dl/devicePlugin
mkdir -m 750 /var/log/mindx-dl/noded
chown hwMindX:hwMindX /var/log/mindx-dl/noded

Run the following commands in sequence to create log directories on the management node.

mkdir -m 755 /var/log/mindx-dl
chown root:root /var/log/mindx-dl
mkdir -m 750 /var/log/mindx-dl/volcano-controller
chown hwMindX:hwMindX /var/log/mindx-dl/volcano-controller
mkdir -m 750 /var/log/mindx-dl/volcano-scheduler
chown hwMindX:hwMindX /var/log/mindx-dl/volcano-scheduler
mkdir -m 750 /var/log/mindx-dl/ascend-operator
chown hwMindX:hwMindX /var/log/mindx-dl/ascend-operator
mkdir -m 750 /var/log/mindx-dl/clusterd
chown hwMindX:hwMindX /var/log/mindx-dl/clusterd

Run the following command on any node to create a namespace:
```
kubectl create ns mindx-dl
```

Install components.

Run the following commands in sequence to install Ascend Docker Runtime on the host of the compute node:

cd /home/Ascend-docker-runtime
chmod u+x Ascend-docker-runtime_7.3.0_linux-aarch64.run
./Ascend-docker-runtime_7.3.0_linux-aarch64.run --install
systemctl daemon-reload && systemctl restart docker

Run the following commands in sequence to copy the component startup YAML files of the compute node to the installation directory of the corresponding component on the management node:

cd /home/noded
scp noded-v7.3.0.yaml root@{IP_address_of_the_management_node}:/home/noded

cd /home/devicePlugin
scp device-plugin-volcano-v7.3.0.yaml root@{IP_address_of_the_management_node:/home/devicePlugin

Run the following commands in sequence on the management node to install components:

cd /home/ascend-operator
kubectl apply -f ascend-operator-v7.3.0.yaml

cd /home/ascend-volcano/volcano-v1.7.0  # Change v1.7.0 to v1.9.0 if Volcano 1.9.0 is required.
kubectl apply -f volcano-v1.7.0.yaml

cd /home/noded
kubectl apply -f noded-v7.3.0.yaml

cd /home/clusterd
kubectl apply -f clusterd-v7.3.0.yaml

cd /home/devicePlugin
kubectl apply -f device-plugin-volcano-v7.3.0.yaml

Take NodeD as an example. If the following information is displayed, the component is successfully installed.

serviceaccount/noded created
clusterrole.rbac.authorization.k8s.io/pods-noded-role created
clusterrolebinding.rbac.authorization.k8s.io/pods-noded-rolebinding created
daemonset.apps/noded created

Run the following command on the management node to check whether the component is started:

kubectl get pod -n mindx-dl

Take NodeD as an example. If Running is displayed in the command output, the component is started successfully.

NAME                              READY   STATUS    RESTARTS   AGE
...
noded-fd6t8                       1/1     Running   0          74s
...

Delivering a Training Job

Build an image.
Download the ascend-pytorch training image of the 24.0.X version from the Ascend image repository based on the system architecture (ARM/x86_64). Change the default user in the container to root based on the training base image. The image does not contain files such as the training script and code. During training, those files are directly mapped to the container.
Perform script adaptation.
1. Download ResNet50_ID4149_for_PyTorch from the master branch in the PyTorch code repository and use it as the training code.
2. Prepare a dataset corresponding to ResNet-50, and comply with corresponding specifications when using the dataset.
3. Upload the dataset to the storage node as an administrator. Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet.
```
root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet# pwd
```
4. Decompress the training code downloaded in 1 to the local host, and upload the ModelZoo-PyTorch/PyTorch/built-in/cv/classification/ResNet50_ID4149_for_PyTorch directory in the decompressed training code to a directory in the environment, for example, /data/atlas_dls/public/code/.
5. In the /data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch directory, comment out the following code in main.py:
```
def main():
    args = parser.parse_args()
    os.environ['MASTER_ADDR'] = args.addr
    #os.environ['MASTER_PORT'] = '29501'  # Comment out this line of code.
    if os.getenv('ALLOW_FP32', False) and os.getenv('ALLOW_HF32', False):
        raise RuntimeError('ALLOW_FP32 and ALLOW_HF32 cannot be set at the same time!')
    elif os.getenv('ALLOW_HF32', False):
        torch.npu.conv.allow_hf32 = True
    elif os.getenv('ALLOW_FP32', False):
        torch.npu.conv.allow_hf32 = False
        torch.npu.matmul.allow_hf32 = False
```
6. Go to the mindcluster-deploy repository, select a matched branch based on mindcluster-deploy Version Description, obtain the train_start.sh file from the samples/train/basic-training/without-ranktable/pytorch directory, and construct the following directory structure in the /data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts directory.
```
root@ubuntu:/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts#
scripts/
     ├── train_start.sh
```

Prepare the job YAML file.

Go to the mindcluster-deploy repository, select a matched branch based on mindcluster-deploy Version Description, and obtain the pytorch_standalone_acjob_{xxx}b.yaml file in the samples/train/basic-training/without-ranktable/pytorch directory. ({xxx} indicates the processor model.) A single-server single-processor job is presented in the example file by default.

Modify the example YAML file and upload it to any file path. For details about the parameters in the YAML file, see Table 1.

apiVersion: mindxdl.gitee.com/v1
kind: AscendJob
...
spec:
...
  replicaSpecs:
    Master:
...
        spec:
          nodeSelector:
            host-arch: huawei-arm
            accelerator-type: module-{xxx}b-8   # card-{xxx}b-2 is changed to module-{xxx}b-8, where {xxx} indicates the processor model.
          containers:
          - name: ascend 
            image: pytorch-test:latest     # Change the value to the image name obtained in Step 1.
...
            resources:
              limits:
                huawei.com/Ascend910: 1
              requests:
                huawei.com/Ascend910: 1
...
          volumes:
          - name: code
            nfs:      # If the NFS service is not installed, change nfs to hostPath and delete server: 127.0.0.1.
              server: 127.0.0.1
              path: "/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/"
          - name: data
            nfs:     # If the NFS service is not installed, change nfs to hostPath and delete server: 127.0.0.1.
              server: 127.0.0.1
              path: "/data/atlas_dls/public/dataset/"
          - name: output
            nfs:     # If the NFS service is not installed, change nfs to hostPath and delete server: 127.0.0.1.
              server: 127.0.0.1
              path: "/data/atlas_dls/output/"
...

Run the following command to deliver a single-server single-processor job:
```
kubectl apply -f pytorch_standalone_acjob_{xxx}b.yaml
```
Run the following command to check the pod running status:
```
kubectl get pod --all-namespaces -o wide
```
A sample command output is as follows. If "Running" is displayed, the job is running properly.
```
NAMESPACE        NAME                                       READY   STATUS    RESTARTS   AGE     IP                NODE      NOMINATED NODE   READINESS GATES
default          default-test-pytorch-master-0              1/1     Running   0          6s      192.168.244.xxx   worker01   <none>           <none>
```
If the training job is always in the Pending state after being delivered, refer to Training Job Is in the Pending State Because "nodes are unavailable" or A Job Is Pending Due to Insufficient Resources to rectify the fault.

View the training result.

Run the following command on any node to view the training result:

kubectl logs -n Namespace_name Pod_name

Example:

kubectl logs -n default default-test-pytorch-master-0

View the training logs. If the following information is displayed, the training is successful:

[20251218-20:31:57] [MindXDL Service Log]server id is: 0
/usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=7 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
/usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=6 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
/usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=5 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
/usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=4 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
/usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=3 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
/usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=2 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
/usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=1 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
/usr/local/python3.10.5/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
  warn(
[2025-12-18 20:32:02] [WARNING] [470] profiler.py: Invalid parameter export_type: None, reset it to text.
/job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py:201: UserWarning: You have chosen to seed training. This will turn on the CUDNN deterministic setting, which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
  warnings.warn('You have chosen to seed training. '
/job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py:208: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism.
  warnings.warn('You have chosen a specific GPU. This will completely '
Use GPU: 0 for training
=> creating model 'resnet50'