Quick Start

This section uses a single Atlas 800T A2 training server (functioning as both a management node and a compute node) as an example to describe how to quickly install NodeD, Ascend Device Plugin, Ascend Docker Runtime, Volcano, ClusterD, and Ascend Operator and use the full-NPU scheduling feature to quickly deliver training jobs.

Operation Description

Table 1 Key steps

Procedure

Operation Description

Reference

Installing Components

This step uses an Atlas 800T A2 training server as an example to describe how to quickly install cluster scheduling components on Ascend devices.

For more details, see Installation and Deployment.

Delivering a Training Job

This step uses a simple PyTorch training job as an example to describe how to deliver a training job.

For more details, see Basic Scheduling.

Environment Setup

Before installing components, ensure that a cluster environment has been set up.

  • Kubernetes has been installed on all nodes. The supported versions are 1.17.x to 1.34.x. If Volcano is required, install Kubernetes 1.19.x or later. For details about the Kubernetes version, refer to Kubernetes compatibility on the Volcano official website. To obtain the software package, visit the Kubernetes community.
  • Docker has been installed on all nodes. The supported versions are 18.09.x to 28.5.1 To obtain the software package, visit the Docker community or the official website.
  • The firmware and drivers have been installed on all nodes.
  • Check whether npu-smi and hccn_tool can run properly on the host.
    • Check whether the firmware and driver versions match cluster scheduling components. For details, see Ascend Training Solution Version Mapping.
    • To query the NPU driver and firmware version, run the npu-smi info -t board -i NPU ID command. In the command output, the value of Software Version is the NPU driver version, and the value of Firmware Version is the NPU firmware version.
    • You can run the npu-smi info command to query the number in the chip model name, which is indicated by the Name field in the returned message. As an example below, the value of {xxx} is 910.

Installing Components

This part uses an Atlas 800T A2 training server as an example. For details about the installation procedure and parameter description of all components, see Installation.

  1. Log in to the compute or management node as the root user and create component installation directories.
    1. Run the following commands in sequence to create installation directories on the compute node. The following directories are only examples.
      mkdir /home/noded
      mkdir /home/devicePlugin
      mkdir /home/Ascend-docker-runtime
    2. Run the following commands in sequence to create installation directories on the management node. The following directories are only examples.
      mkdir /home/ascend-volcano
      mkdir /home/ascend-operator
      mkdir /home/clusterd
      mkdir /home/noded
      mkdir /home/devicePlugin
  2. Download the software packages of the corresponding architecture as required. The following uses the AArch64 architecture as an example.
    1. Run the following commands in sequence on the compute node to download the NodeD, Ascend Device Plugin, and Ascend Docker Runtime installation packages and decompress them:
      cd /home/noded
      wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-noded_7.3.0_linux-aarch64.zip
      unzip Ascend-mindxdl-noded_7.3.0_linux-aarch64.zip
      
      cd /home/devicePlugin
      wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-device-plugin_7.3.0_linux-aarch64.zip
      unzip Ascend-mindxdl-device-plugin_7.3.0_linux-aarch64.zip
      
      cd /home/Ascend-docker-runtime
      wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-docker-runtime_7.3.0_linux-aarch64.run
    2. Run the following commands in sequence on the management node to download Volcano, ClusterD, and Ascend Operator installation packages:
      cd /home/ascend-volcano
      wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-volcano_7.3.0_linux-aarch64.zip
      unzip Ascend-mindxdl-volcano_7.3.0_linux-aarch64.zip
      
      cd /home/ascend-operator
      wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-ascend-operator_7.3.0_linux-aarch64.zip
      unzip Ascend-mindxdl-ascend-operator_7.3.0_linux-aarch64.zip
      
      cd /home/clusterd
      wget https://gitcode.com/Ascend/mind-cluster/releases/download/v7.3.0/Ascend-mindxdl-clusterd_7.3.0_linux-aarch64.zip
      unzip Ascend-mindxdl-clusterd_7.3.0_linux-aarch64.zip
  3. Build component images.
    1. Run the following command to pull the base image on the compute node:
      docker pull ubuntu:22.04
    2. Run the following commands in sequence to pull the base image on the management node:
      docker pull arm64v8/alpine:latest
      docker tag arm64v8/alpine:latest alpine:latest
      docker pull ubuntu:22.04
    3. Run the following commands in sequence to build component images on the compute node:
      cd /home/noded
      docker build --no-cache -t noded:v7.3.0 ./
      
      cd /home/devicePlugin
      docker build --no-cache -t ascend-k8sdeviceplugin:v7.3.0 ./
    4. Run the following commands in sequence to build component images on the management node:
      cd /home/ascend-volcano/volcano-v1.7.0
      docker build --no-cache -t volcanosh/vc-scheduler:v1.7.0 ./ -f ./Dockerfile-scheduler
      docker build --no-cache -t volcanosh/vc-controller-manager:v1.7.0 ./ -f ./Dockerfile-controller
      
      cd /home/ascend-operator
      docker build --no-cache -t ascend-operator:v7.3.0 ./
      
      cd /home/clusterd
      docker build --no-cache -t clusterd:v7.3.0 ./
  4. Create a node label.
    1. Run the following command on the Kubernetes management node to query the node name:
      kubectl get node  
      Command output:
      1
      2
      NAME       STATUS   ROLES           AGE   VERSION
      worker01   Ready    worker    23h   v1.17.3
      
    2. Run the following commands in sequence to create a label (for example, worker01) for the compute node:
      kubectl label nodes worker01 node-role.kubernetes.io/worker=worker
      kubectl label nodes worker01 workerselector=dls-worker-node
      kubectl label nodes worker01 host-arch=huawei-arm
      kubectl label nodes worker01 accelerator=huawei-Ascend910
      kubectl label nodes worker01 accelerator-type=module-{xxx}b-8     # Enter the number that indicates the processor model.
      kubectl label nodes worker01 nodeDEnable=on
    3. Run the following command to create a label (for example, master01) for the management node:
      kubectl label nodes master01 masterselector=dls-master-node
  5. Create a user account.
    1. Run the following commands in sequence to create a user account on the compute node:
      useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
      usermod -a -G HwHiAiUser hwMindX
    2. Run the following command to create a user account on the management node:
      useradd -d /home/hwMindX -u 9000 -m -s /usr/sbin/nologin hwMindX
  6. Create a log directory. Custom log directories are not supported.
    1. Run the following commands in sequence to create log directories on the compute node:
      mkdir -m 755 /var/log/mindx-dl
      chown root:root /var/log/mindx-dl
      mkdir -m 750 /var/log/mindx-dl/devicePlugin
      chown root:root /var/log/mindx-dl/devicePlugin
      mkdir -m 750 /var/log/mindx-dl/noded
      chown hwMindX:hwMindX /var/log/mindx-dl/noded
    2. Run the following commands in sequence to create log directories on the management node.
      mkdir -m 755 /var/log/mindx-dl
      chown root:root /var/log/mindx-dl
      mkdir -m 750 /var/log/mindx-dl/volcano-controller
      chown hwMindX:hwMindX /var/log/mindx-dl/volcano-controller
      mkdir -m 750 /var/log/mindx-dl/volcano-scheduler
      chown hwMindX:hwMindX /var/log/mindx-dl/volcano-scheduler
      mkdir -m 750 /var/log/mindx-dl/ascend-operator
      chown hwMindX:hwMindX /var/log/mindx-dl/ascend-operator
      mkdir -m 750 /var/log/mindx-dl/clusterd
      chown hwMindX:hwMindX /var/log/mindx-dl/clusterd
  7. Run the following command on any node to create a namespace:
    kubectl create ns mindx-dl
  8. Install components.
    1. Run the following commands in sequence to install Ascend Docker Runtime on the host of the compute node:
      cd /home/Ascend-docker-runtime
      chmod u+x Ascend-docker-runtime_7.3.0_linux-aarch64.run
      ./Ascend-docker-runtime_7.3.0_linux-aarch64.run --install
      systemctl daemon-reload && systemctl restart docker
    2. Run the following commands in sequence to copy the component startup YAML files of the compute node to the installation directory of the corresponding component on the management node:
      cd /home/noded
      scp noded-v7.3.0.yaml root@{IP_address_of_the_management_node}:/home/noded
      
      cd /home/devicePlugin
      scp device-plugin-volcano-v7.3.0.yaml root@{IP_address_of_the_management_node:/home/devicePlugin
    3. Run the following commands in sequence on the management node to install components:
      cd /home/ascend-operator
      kubectl apply -f ascend-operator-v7.3.0.yaml
      
      cd /home/ascend-volcano/volcano-v1.7.0  # Change v1.7.0 to v1.9.0 if Volcano 1.9.0 is required.
      kubectl apply -f volcano-v1.7.0.yaml
      
      cd /home/noded
      kubectl apply -f noded-v7.3.0.yaml
      
      cd /home/clusterd
      kubectl apply -f clusterd-v7.3.0.yaml
      
      cd /home/devicePlugin
      kubectl apply -f device-plugin-volcano-v7.3.0.yaml
      Take NodeD as an example. If the following information is displayed, the component is successfully installed.
      1
      2
      3
      4
      serviceaccount/noded created
      clusterrole.rbac.authorization.k8s.io/pods-noded-role created
      clusterrolebinding.rbac.authorization.k8s.io/pods-noded-rolebinding created
      daemonset.apps/noded created
      
    4. Run the following command on the management node to check whether the component is started:
      kubectl get pod -n mindx-dl

      Take NodeD as an example. If Running is displayed in the command output, the component is started successfully.

      1
      2
      3
      4
      NAME                              READY   STATUS    RESTARTS   AGE
      ...
      noded-fd6t8                       1/1     Running   0          74s
      ...
      

Delivering a Training Job

  1. Build an image.

    Download the ascend-pytorch training image of the 24.0.X version from the Ascend image repository based on the system architecture (ARM/x86_64). Change the default user in the container to root based on the training base image. The image does not contain files such as the training script and code. During training, those files are directly mapped to the container.

  2. Perform script adaptation.
    1. Download ResNet50_ID4149_for_PyTorch from the master branch in the PyTorch code repository and use it as the training code.
    2. Prepare a dataset corresponding to ResNet-50, and comply with corresponding specifications when using the dataset.
    3. Upload the dataset to the storage node as an administrator. Go to the /data/atlas_dls/public directory and upload the dataset to any directory, for example, /data/atlas_dls/public/dataset/resnet50/imagenet.
      root@ubuntu:/data/atlas_dls/public/dataset/resnet50/imagenet# pwd
    4. Decompress the training code downloaded in 1 to the local host, and upload the ModelZoo-PyTorch/PyTorch/built-in/cv/classification/ResNet50_ID4149_for_PyTorch directory in the decompressed training code to a directory in the environment, for example, /data/atlas_dls/public/code/.
    5. In the /data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch directory, comment out the following code in main.py:
      def main():
          args = parser.parse_args()
          os.environ['MASTER_ADDR'] = args.addr
          #os.environ['MASTER_PORT'] = '29501'  # Comment out this line of code.
          if os.getenv('ALLOW_FP32', False) and os.getenv('ALLOW_HF32', False):
              raise RuntimeError('ALLOW_FP32 and ALLOW_HF32 cannot be set at the same time!')
          elif os.getenv('ALLOW_HF32', False):
              torch.npu.conv.allow_hf32 = True
          elif os.getenv('ALLOW_FP32', False):
              torch.npu.conv.allow_hf32 = False
              torch.npu.matmul.allow_hf32 = False
    6. Go to the mindcluster-deploy repository, select a matched branch based on mindcluster-deploy Version Description, obtain the train_start.sh file from the samples/train/basic-training/without-ranktable/pytorch directory, and construct the following directory structure in the /data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts directory.
      root@ubuntu:/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/scripts#
      scripts/
           ├── train_start.sh
  3. Prepare the job YAML file.
    1. Go to the mindcluster-deploy repository, select a matched branch based on mindcluster-deploy Version Description, and obtain the pytorch_standalone_acjob_{xxx}b.yaml file in the samples/train/basic-training/without-ranktable/pytorch directory. ({xxx} indicates the processor model.) A single-server single-processor job is presented in the example file by default.
    2. Modify the example YAML file and upload it to any file path. For details about the parameters in the YAML file, see Table 1.
      apiVersion: mindxdl.gitee.com/v1
      kind: AscendJob
      ...
      spec:
      ...
        replicaSpecs:
          Master:
      ...
              spec:
                nodeSelector:
                  host-arch: huawei-arm
                  accelerator-type: module-{xxx}b-8   # card-{xxx}b-2 is changed to module-{xxx}b-8, where {xxx} indicates the processor model.
                containers:
                - name: ascend 
                  image: pytorch-test:latest     # Change the value to the image name obtained in Step 1.
      ...
                  resources:
                    limits:
                      huawei.com/Ascend910: 1
                    requests:
                      huawei.com/Ascend910: 1
      ...
                volumes:
                - name: code
                  nfs:      # If the NFS service is not installed, change nfs to hostPath and delete server: 127.0.0.1.
                    server: 127.0.0.1
                    path: "/data/atlas_dls/public/code/ResNet50_ID4149_for_PyTorch/"
                - name: data
                  nfs:     # If the NFS service is not installed, change nfs to hostPath and delete server: 127.0.0.1.
                    server: 127.0.0.1
                    path: "/data/atlas_dls/public/dataset/"
                - name: output
                  nfs:     # If the NFS service is not installed, change nfs to hostPath and delete server: 127.0.0.1.
                    server: 127.0.0.1
                    path: "/data/atlas_dls/output/"
      ...
  4. Run the following command to deliver a single-server single-processor job:
    kubectl apply -f pytorch_standalone_acjob_{xxx}b.yaml
  5. Run the following command to check the pod running status:
    kubectl get pod --all-namespaces -o wide
    A sample command output is as follows. If "Running" is displayed, the job is running properly.
    NAMESPACE        NAME                                       READY   STATUS    RESTARTS   AGE     IP                NODE      NOMINATED NODE   READINESS GATES
    default          default-test-pytorch-master-0              1/1     Running   0          6s      192.168.244.xxx   worker01   <none>           <none>

    If the training job is always in the Pending state after being delivered, refer to Training Job Is in the Pending State Because "nodes are unavailable" or A Job Is Pending Due to Insufficient Resources to rectify the fault.

  6. View the training result.
    1. Run the following command on any node to view the training result:
      kubectl logs -n Namespace_name Pod_name

      Example:

      kubectl logs -n default default-test-pytorch-master-0
    2. View the training logs. If the following information is displayed, the training is successful:
      [20251218-20:31:57] [MindXDL Service Log]server id is: 0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=7 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=6 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=5 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=4 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=3 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=2 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/bin/python /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py --data=/job/data/resnet50/imagenet --amp --arch=resnet50 --seed=49 -j=128 --world-size=1 --lr=1.6 --dist-backend=hccl --multiprocessing-distributed --epochs=1 --batch-size=512 --gpu=1 --multiprocessing-distributed --addr=10.106.227.104 --world-size=1 --rank=0
      /usr/local/python3.10.5/lib/python3.10/site-packages/torchvision/io/image.py:13: UserWarning: Failed to load image Python extension: ''If you don't plan on using image functionality from `torchvision.io`, you can ignore this warning. Otherwise, there might be something wrong with your environment. Did you have `libjpeg` or `libpng` installed before building `torchvision` from source?
        warn(
      [2025-12-18 20:32:02] [WARNING] [470] profiler.py: Invalid parameter export_type: None, reset it to text.
      /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py:201: UserWarning: You have chosen to seed training. This will turn on the CUDNN deterministic setting, which can slow down your training considerably! You may see unexpected behavior when restarting from checkpoints.
        warnings.warn('You have chosen to seed training. '
      /job/code/No_Rank_ResNet50_ID4149_for_PyTorch/main.py:208: UserWarning: You have chosen a specific GPU. This will completely disable data parallelism.
        warnings.warn('You have chosen a specific GPU. This will completely '
      Use GPU: 0 for training
      => creating model 'resnet50'