Starting the HCCL-Controller

The HCCL-Controller is used together with the Volcano and Ascend Device Plugin (the startup parameter volcanoType is set to true). When NPU-based training jobs are delivered based on the training job YAML template obtained in "Creating a YAML File" of each training framework in the training job section, the Ascend AI Processor resource configuration file (ranktable file) in "Template 1" format is generated by default. For details about the template format, see "Manual Porting and Training > Distributed Training > Resource Configuration Files for Ascend AI Processors > Template 1" in the CANN TensorFlow 1.15 Model Porting Guide.

Procedure

  1. Log in to the Kubernetes master node as the root user and run the following command to check whether the HCCL-Controller image and version are correct:
    docker images | grep hccl-controller
    Example:
    root@ubuntu:# docker images | grep hccl-controller
    hccl-controller                      v3.0.0              f78993dcf54f        About an hour ago         143MB
    • If yes, go to 2.
    • If no, create and distribute the image. For details, see Creating an Image.
  2. Copy the YAML file in the directory where the HCCL-Controller software package is decompressed (for example, /home/ascend-hccl-controller) to any directory on the Kubernetes master node (for example, /home/ascend-hccl-controller). If the HCCL-Controller software package is decompressed on the Kubernetes master node, you do not need to copy the YAML file.
    cd /home/ascend-hccl-controller
    scp root@{IP_address_of_the_node_where_the_software_package_is_decompressed}:/home/ascend-hccl-controller/hccl-controller-*.yaml ./
  3. Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the HCCL-Controller startup parameters in the corresponding startup YAML file based on your requirements. For details about the startup parameters, see Table 1. You can run the ./hccl-controller -h command to view the parameter description.
  4. Start the HCCL-Controller.
    • Run the following command if the KubeConfig certificate is imported:
      kubectl apply -f hccl-controller-without-token-{version}.yaml
    • Run the following command if the KubeConfig certificate is not imported:
      kubectl apply -f hccl-controller-{version}.yaml

    The following is a startup example:

    root@ubuntu:/home/ascend-hccl-controller# kubectl apply -f hccl-controller-without-token-v3.0.0.yaml 
    deployment.apps/hccl-controller created
    root@ubuntu:/home/ascend-hccl-controller# kubectl get pod -n mindx-dl
    NAME                               READY   STATUS    RESTARTS   AGE
    ...
    hccl-controller-5d484dcc68-wfvrr   1/1    Running   0          11s
    ...

Parameters

Table 1 HCCL-Controller startup parameters

Parameter

Type

Default Value

Description

-jobParallelism

int

1

Number of concurrent jobs. The value range is [1, 32].

-podParallelism

int

1

Number of concurrent pod jobs. The value range is [1, 32].

-version

bool

false

HCCL-Controller binary version number.

-json

string

v2

Template of the ranktable file generated by the HCCL-Controller.

  • v1 indicates template 2.
  • v2 indicates template 1.

-logLevel

int

0

Log level.

  • -1: debug
  • 0: info
  • 1: warning
  • 2: error
  • 3: critical

-maxAge

int

7

Log backup time limit. The value ranges from 7 to 700, in days.

-logFile

string

/var/log/mindx-dl/hccl-controller/hccl-controller.log

Log file.

NOTE:

If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed.

-maxBackups

int

30

Maximum number of dumped log files that can be retained. The value range is (0, 30].

-kubeConfig

string

/etc/mindx-dl/hccl-controller/.config/config6

Path for storing the encrypted KubeConfig file by default. The KubeConfig file in a user-defined path is also supported. If the configuration file does not exist in the default path, InClusterConfig is enabled.

NOTE:

This file must be encrypted using the certificate import tool. A plaintext file is not supported.

-h

None

N/A

Help information.