Starting the Resilience-Controller

Procedure

  1. Log in to the Kubernetes master node as the root user and run the following command to check whether the Resilience-Controller image and version are correct:
    docker images | grep resilience-controller
    Example:
    root@ubuntu:# docker images | grep resilience-controller
    resilience-controller                      v3.0.0              c532e9d0889c        About an hour ago         142MB
    • If yes, go to 2.
    • If no, create an image and distribute it. For details, see Creating an Image.
  2. Copy the YAML file in the directory where the Resilience-Controller software package is decompressed (for example, /home/ascend-resilience-controller) to any directory on the Kubernetes master node (for example, /home/ascend-resilience-controller). If the Resilience-Controller software package is decompressed on the Kubernetes master node, you do not need to copy the YAML file.
    cd /home/ascend-resilience-controller
    scp root@{IP_address_of_the_node_where_the_software_package_is_decompressed}:/home/ascend-resilience-controller/resilience-controller-*.yaml ./
  3. Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the startup parameters of Resilience-Controller in the YAML file based on your requirements. For details about the startup parameters, see Table 1. You can run the ./resilience-controller -h command to view the parameter description.
  4. Start the Resilience-Controller.
    • Run the following command if the KubeConfig certificate is imported.
      kubectl apply -f resilience-controller-without-token-{version}.yaml
    • Run the following command if the KubeConfig certificate is not imported:
      kubectl apply -f resilience-controller-*.yaml

    The following is an example:

    root@ubuntu:/home/ascend-resilience-controller# kubectl apply -f resilience-controller-without-token-v3.0.0.yaml 
    deployment.apps/resilience-controller created
    root@ubuntu:/home/ascend-resilience-controller# kubectl get pod -n mindx-dl
    NAME                                         READY   STATUS    RESTARTS   AGE
    ...
    resilience-controller-7667495b6b-hwmjw   1/1    Running   0          11s
    ...

Parameters

Table 1 Resilience-Controller startup parameters

Parameter

Type

Default Value

Description

-version

bool

false

Resilience-Controller binary version number.

-logLevel

int

0

Log level.

  • -1: debug
  • 0: info
  • 1: warning
  • 2: error
  • 3: critical

-maxAge

int

7

Log backup time limit. The value ranges from 7 to 700, in days.

-logFile

string

/var/log/mindx-dl/resilience-controller/run.log

Log file.

NOTE:

If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed.

-maxBackups

int

30

Maximum number of dumped log files that can be retained. The value range is (0, 30].

-h

None

N/A

Help information.