Starting the HCCL-Controller
The HCCL-Controller is used together with the Volcano and Ascend Device Plugin (the startup parameter volcanoType is set to true). When NPU-based training jobs are delivered based on the training job YAML template obtained in "Creating a YAML File" of each training framework in the training job section, the Ascend AI Processor resource configuration file (ranktable file) in "Template 1" format is generated by default. For details about the template format, see "Manual Porting and Training > Distributed Training > Resource Configuration Files for Ascend AI Processors > Template 1" in the CANN TensorFlow 1.15 Model Porting Guide.
Procedure
- Log in to the Kubernetes master node as the root user and run the following command to check whether the HCCL-Controller image and version are correct:
docker images | grep hccl-controller
Example:root@ubuntu:# docker images | grep hccl-controller hccl-controller v3.0.0 f78993dcf54f About an hour ago 143MB
- If yes, go to 2.
- If no, create and distribute the image. For details, see Creating an Image.
- Copy the YAML file in the directory where the HCCL-Controller software package is decompressed (for example, /home/ascend-hccl-controller) to any directory on the Kubernetes master node (for example, /home/ascend-hccl-controller). If the HCCL-Controller software package is decompressed on the Kubernetes master node, you do not need to copy the YAML file.
cd /home/ascend-hccl-controller scp root@{IP_address_of_the_node_where_the_software_package_is_decompressed}:/home/ascend-hccl-controller/hccl-controller-*.yaml ./ - Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the HCCL-Controller startup parameters in the corresponding startup YAML file based on your requirements. For details about the startup parameters, see Table 1. You can run the ./hccl-controller -h command to view the parameter description.
- Start the HCCL-Controller.
- Run the following command if the KubeConfig certificate is imported:
kubectl apply -f hccl-controller-without-token-{version}.yaml - Run the following command if the KubeConfig certificate is not imported:
kubectl apply -f hccl-controller-{version}.yaml
The following is a startup example:
root@ubuntu:/home/ascend-hccl-controller# kubectl apply -f hccl-controller-without-token-v3.0.0.yaml deployment.apps/hccl-controller created root@ubuntu:/home/ascend-hccl-controller# kubectl get pod -n mindx-dl NAME READY STATUS RESTARTS AGE ... hccl-controller-5d484dcc68-wfvrr 1/1 Running 0 11s ...
- Run the following command if the KubeConfig certificate is imported:
Parameters
Parameter |
Type |
Default Value |
Description |
|---|---|---|---|
-jobParallelism |
int |
1 |
Number of concurrent jobs. The value range is [1, 32]. |
-podParallelism |
int |
1 |
Number of concurrent pod jobs. The value range is [1, 32]. |
-version |
bool |
false |
HCCL-Controller binary version number. |
-json |
string |
v2 |
Template of the ranktable file generated by the HCCL-Controller.
|
-logLevel |
int |
0 |
Log level.
|
-maxAge |
int |
7 |
Log backup time limit. The value ranges from 7 to 700, in days. |
-logFile |
string |
/var/log/mindx-dl/hccl-controller/hccl-controller.log |
Log file. NOTE:
If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. |
-maxBackups |
int |
30 |
Maximum number of dumped log files that can be retained. The value range is (0, 30]. |
-kubeConfig |
string |
/etc/mindx-dl/hccl-controller/.config/config6 |
Path for storing the encrypted KubeConfig file by default. The KubeConfig file in a user-defined path is also supported. If the configuration file does not exist in the default path, InClusterConfig is enabled. NOTE:
This file must be encrypted using the certificate import tool. A plaintext file is not supported. |
-h |
None |
N/A |
Help information. |