Installing Resilience Controller
- Before enabling elastic training, you must install Resilience Controller. When Resilience Controller connects to Kubernetes, you can use ServiceAccount or KubeConfig for authentication. For details about the differences between the two modes, see Differences Between ServiceAccount and KubeConfig.
- If you do not need the elastic training function, skip this section.
Procedure
- Log in to the Kubernetes management node as the root user and check whether the Resilience Controller image and version number are correct.
docker images | grep resilience-controller
Command output:1resilience-controller v7.3.0 c532e9d0889c About an hour ago 142MB
- If correct, proceed to Step 2.
- If not correct, create the image and distribute it by referring to Preparing an Image.
- Copy the YAML file in the directory where the Resilience Controller package is decompressed to any directory on the Kubernetes management node.
- Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the Resilience Controller startup parameters in the YAML file based on your requirements. For details about the startup parameters, see Table 1. You can also run the ./resilience-controller -h command to view the parameter description.
- Run the following command in the directory where the YAML file of the management node is stored to start Resilience Controller.
- Run the following command if the KubeConfig certificate is not imported:
kubectl apply -f resilience-controller-v{version}.yamlStartup example:
serviceaccount/resilience-controller created clusterrole.rbac.authorization.k8s.io/pods-resilience-controller-role created clusterrolebinding.rbac.authorization.k8s.io/resilience-controller-rolebinding created deployment.apps/resilience-controller created
- Run the following command if the KubeConfig certificate is imported:
kubectl apply -f resilience-controller-without-token-v{version}.yamlStartup example:
deployment.apps/resilience-controller created
- Run the following command if the KubeConfig certificate is not imported:
- Run the following command to check whether the component is installed successfully:
kubectl get pod -n mindx-dl
The following is a startup example. If Running is displayed, the component is started successfully.
1 2 3 4
NAME READY STATUS RESTARTS AGE ... resilience-controller-7667495b6b-hwmjw 1/1 Running 0 11s ...
- After the component is installed, if the pod status of the component is not Running, refer to Component pods Are Not in the Running State.
- After the component is installed, if the pod status of the component is ContainerCreating, refer to Cluster Scheduling Component Pods Are in the ContainerCreating State.
- If the component fails to be started, refer to Cluster Scheduling Components Fail to Start and "get sem errno =13" Is Displayed in Logs.
- If the component is started successfully, but the corresponding pod cannot be found, refer to YAML File for Starting a Component Is Successfully Executed, But the pod Corresponding to the Component Is Not Displayed.
Parameters
Parameter |
Type |
Default Value |
Description |
|---|---|---|---|
-version |
Bool |
false |
Whether to query the Resilience Controller version number.
|
-logLevel |
Integer |
0 |
Log level:
|
-maxAge |
Integer |
7 |
Time limit for backing up logs. The value ranges from 7 to 700, in days. |
-logFile |
String |
/var/log/mindx-dl/resilience-controller/run.log |
Log file. NOTE:
If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. Dumped files are named in the format of "run-dump triggering time.log", for example, run-2023-10-07T03-38-24.402.log. |
-maxBackups |
Integer |
30 |
Maximum number of dumped log files that can be retained. The value ranges from 1 to 30. |
-h or -help |
None |
None |
Help information. |