Installing Resilience Controller

  • Before enabling elastic training, you must install Resilience Controller. When Resilience Controller connects to Kubernetes, you can use ServiceAccount or KubeConfig for authentication. For details about the differences between the two modes, see Differences Between ServiceAccount and KubeConfig.
  • If you do not need the elastic training function, skip this section.

Procedure

  1. Log in to the Kubernetes management node as the root user and check whether the Resilience Controller image and version number are correct.
    docker images | grep resilience-controller
    Command output:
    1
    resilience-controller                      v7.3.0              c532e9d0889c        About an hour ago         142MB
    
  2. Copy the YAML file in the directory where the Resilience Controller package is decompressed to any directory on the Kubernetes management node.
  3. Skip this step if you do not need to modify the component startup parameters. Otherwise, modify the Resilience Controller startup parameters in the YAML file based on your requirements. For details about the startup parameters, see Table 1. You can also run the ./resilience-controller -h command to view the parameter description.
  4. Run the following command in the directory where the YAML file of the management node is stored to start Resilience Controller.
    • Run the following command if the KubeConfig certificate is not imported:
      kubectl apply -f resilience-controller-v{version}.yaml

      Startup example:

      serviceaccount/resilience-controller created
      clusterrole.rbac.authorization.k8s.io/pods-resilience-controller-role created
      clusterrolebinding.rbac.authorization.k8s.io/resilience-controller-rolebinding created
      deployment.apps/resilience-controller created
    • Run the following command if the KubeConfig certificate is imported:
      kubectl apply -f resilience-controller-without-token-v{version}.yaml

      Startup example:

      deployment.apps/resilience-controller created
  5. Run the following command to check whether the component is installed successfully:
    kubectl get pod -n mindx-dl

    The following is a startup example. If Running is displayed, the component is started successfully.

    1
    2
    3
    4
    NAME                                            READY    STATUS      RESTARTS   AGE
    ...
    resilience-controller-7667495b6b-hwmjw   1/1     Running   0         11s
    ...
    

Parameters

Table 1 Resilience Controller startup parameters

Parameter

Type

Default Value

Description

-version

Bool

false

Whether to query the Resilience Controller version number.

  • true: queries the version.
  • false: does not query the version.

-logLevel

Integer

0

Log level:

  • -1: debug
  • 0: info
  • 1: warning
  • 2: error
  • 3: critical

-maxAge

Integer

7

Time limit for backing up logs. The value ranges from 7 to 700, in days.

-logFile

String

/var/log/mindx-dl/resilience-controller/run.log

Log file.

NOTE:

If the size of a log file exceeds 20 MB, automatic dump is triggered. The maximum size of a log file cannot be changed. Dumped files are named in the format of "run-dump triggering time.log", for example, run-2023-10-07T03-38-24.402.log.

-maxBackups

Integer

30

Maximum number of dumped log files that can be retained. The value ranges from 1 to 30.

-h or -help

None

None

Help information.