昇腾社区首页
中文
注册

使用样例

限制与约束

  • 主、备Controller节点不支持部署在同一台节点上。
  • 特性生效依赖ETCD服务端正确部署,服务端需要3个副本,以保证ETCD集群的可靠性。
  • ETCD服务端需要使用v3.6版本。
  • PD分离服务部署并开启主备Controller的场景下,主备切换与无冗余故障缩容恢复(D实例硬件故障而发生缩P保D)同时发生时会让无冗余故障恢复失效。

生成ETCD安全证书

Controller主备依赖ETCD分布式锁功能,涉及集群内不同POD间通信,建议使用CA证书做双向认证,证书配置请参考如下步骤:

如果不使用CA证书做双向认证加密通信,则服务间将进行明文传输,可能会存在较高的网络安全风险。

  1. 请用户自行准备证书生成的相关前置文件,文件放置目录以/home/{用户名}/auto_gen_ms_cert为例。
    • server.cnf
      [req] # 主要请求内容
      req_extensions = v3_req
      distinguished_name = req_distinguished_name
      
      [req_distinguished_name] # 证书主体信息
      countryName = CN
      stateOrProvinceName = State
      localityName = City
      organizationName = Organization
      organizationalUnitName = Unit
      commonName = etcd-server
      
      [v3_req] # 核心属性
      basicConstraints = CA:FALSE
      keyUsage = digitalSignature, keyEncipherment
      extendedKeyUsage = serverAuth, clientAuth
      subjectAltName = @alt_names
      
      [alt_names] # 服务标识
      DNS.1 = etcd
      DNS.2 = etcd.default
      DNS.3 = etcd.default.svc
      DNS.4 = etcd.default.svc.cluster.local
      DNS.5 = etcd-0.etcd
      DNS.6 = etcd-0.etcd.default.svc.cluster.local
      DNS.7 = etcd-1.etcd
      DNS.8 = etcd-1.etcd.default.svc.cluster.local
      DNS.9 = etcd-2.etcd
      DNS.10 = etcd-2.etcd.default.svc.cluster.local
    • client.cnf
      [req] # 主要请求内容
      req_extensions = v3_req
      distinguished_name = req_distinguished_name
      
      [req_distinguished_name] # 证书主体信息
      countryName = CN
      stateOrProvinceName = State
      localityName = City
      organizationName = Organization
      organizationalUnitName = Unit
      commonName = etcd-client
      
      [v3_req] # 核心属性
      basicConstraints = CA:FALSE
      keyUsage = digitalSignature, keyEncipherment
      extendedKeyUsage = clientAuth
      subjectAltName = @alt_names
      
      [alt_names] # 服务标识
      DNS.1 = mindie-service-controller
    • crl.conf
      # OpenSSL configuration for CRL generation
      #
      ####################################################################
      [ ca ] # CA框架声明,指示OpenSSL使用哪个预定义的CA配置块作为默认设置
      default_ca = CA_default # The default ca section
      ####################################################################
      [ CA_default ] # 核心CA设置,所有关键路径、文件和默认操作
      dir             = {dir}  # 添加此根目录定义
      database        = $dir/index.txt
      crlnumber       = $dir/pulp_crl_number
      new_certs_dir   = $dir/newcerts
      certificate     = $dir/ca.pem
      private_key     = $dir/ca.key
      serial          = $dir/serial
      database = /home/{用户名}/auto_gen_ms_cert/etcdca4/index.txt
      crlnumber = /home/{用户名}/auto_gen_ms_cert/etcdca4/pulp_crl_number
      
      
      default_days = 365 # how long to certify for
      default_crl_days= 365 # how long before next CRL
      default_md = default # use public key default MD
      preserve = no # keep passed DN ordering
      
      ####################################################################
      [ crl_ext ] # CRL扩展属性
      # CRL extensions.
      # Only issuerAltName and authorityKeyIdentifier make any sense in a CRL.
      # issuerAltName=issuer:copy
      authorityKeyIdentifier=keyid:always,issuer:always

      文件中{dir}路径建议为各节点都能访问的共享目录。

    • gen_etcd_controller_ca.sh
      ```bash
      #!/bin/bash
      # 1. 创建所需文件和目录
      mkdir -p /home/{用户名}/auto_gen_ms_cert/etcdca4/newcerts
      touch /home/{用户名}/auto_gen_ms_cert/etcdca4/index.txt
      echo 1000 > /home/{用户名}/auto_gen_ms_cert/etcdca4/pulp_crl_number
      echo "01" > /home/{用户名}/auto_gen_ms_cert/etcdca4/serial
      
      # 2. 设置权限
      chmod 700 /home/{用户名}/auto_gen_ms_cert/etcdca4/newcerts
      chmod 600 /home/{用户名}/auto_gen_ms_cert/etcdca4/{index.txt,pulp_crl_number,serial}
      
      # 3. 创建CA证书
      openssl genrsa -out ca.key 2048
      openssl req -x509 -new -nodes -key ca.key \
      -subj "/CN=my-cluster-ca" \
      -days 3650 -out ca.pem
      
      # 4. 生成服务端证书
      openssl genrsa -out server.key 2048
      openssl req -new -key server.key -out server.csr \
      -subj "/CN=etcd-server" -config server.cnf
      openssl x509 -req -in server.csr -CA ca.pem -CAkey ca.key -CAcreateserial \
      -out server.pem -days 3650 -extensions v3_req -extfile server.cnf
      
      #5. 吊销列表
      openssl ca -passin pass:{password} -revoke server.pem -keyfile ca.key -cert ca.pem -config crl.conf
      openssl ca -passin pass:{password}  -gencrl -keyfile ca.key -cert ca.pem -out server_crl.pem -config crl.conf
      
      # 6. 生成客户端证书
      openssl genrsa -out client.key 2048
      openssl req -new -key client.key -out client.csr \
      -subj "/CN=etcd-client" -config client.cnf
      openssl x509 -req -in client.csr -CA ca.pem -CAkey ca.key -CAcreateserial \
      -out client.pem -days 3650 -extensions v3_req -extfile client.cnf
      
      # 7. 加密公钥 (使用 KMC)
      # 以kmc_encrypt加密工具为例,并自行配置{password}
      kmc_encrypt -in client.key -out client.key.enc -key_id {password}
      
      # 8. 设置权限
      chmod 0400 ./*.key
      chmod 0400 ./*.pem
  2. 创建crl.conf所需文件和目录。
    touch index.txt
    touch pulp_crl_number
    mkdir newcerts
  3. 执行以下命令运行gen_etcd_controller_ca.sh,生成服务端证书、客户端证书、吊销列表等文件。
    bash gen_etcd_controller_ca.sh

    回显类似如下内容表示生成成功:

    Generating RSA private key, 2048 bit long modulus (2 primes)
    .................+++++
    ............................................................................................................................................................................................................................................................................................................................................................................................................+++++
    e is 65537 (0x010001)
    Generating RSA private key, 2048 bit long modulus (2 primes)
    ..............................................+++++
    ...................................+++++
    e is 65537 (0x010001)
    Signature ok
    subject=CN = etcd-server
    Getting CA Private Key
    Using configuration from crl.conf
    Adding Entry with serial number 3D619FEFB51EEA23F8707008E3C8FACFC8D45547 to DB for /CN=etcd-server
    Revoking Certificate 3D619FEFB51EEA23F8707008E3C8FACFC8D45547.
    Data Base Updated
    Using configuration from crl.conf
    Generating RSA private key, 2048 bit long modulus (2 primes)
    .........................................................+++++
    ...........+++++
    e is 65537 (0x010001)
    Signature ok
    subject=CN = etcd-client
    Getting CA Private Key

    运行完成后,在当前目录生成以下文件或目录:

    ca.key
    ca.pem
    ca.srl
    client.cnf
    client.csr
    client_encrypted.key
    client.key
    client.pem
    crl.conf
    gen_etcd_controller_ca.sh
    index.txt
    index.txt.attr
    index.txt.attr.old
    index.txt.old
    key_pwd.txt
    newcerts
    pulp_crl_number
    pulp_crl_number.old
    serial
    server.cnf
    server_crl.pem
    server.csr
    server.key
    server.pem

部署ETCD服务端

  1. 使用以下命令加载ETCD镜像。
    docker pull quay.io/coreos/etcd:v3.6.0-rc.4
    • 如果docker pull失败,可以用podman命令下载ETCD镜像后保存,再使用docker load命令导入,命令如下:
      apt install podman
      podman pull quay.io/coreos/etcd:v3.6.0-rc.4
    • 需要在三个节点上加载此镜像。
  2. 在集群中创建ETCD资源。
    1. 使用以下命令自行创建local-pvs.yaml文件。
      vim local-pvs.yaml

      在文件中写入以下内容:

      # local-pvs.yaml 创建PV
      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: etcd-data-0  # 必须与StatefulSet的PVC命名规则匹配
      spec:
        capacity:
          storage: 4096M
        volumeMode: Filesystem
        accessModes: [ReadWriteOnce]
        persistentVolumeReclaimPolicy: Retain
        storageClassName: local-storage  # 必须与PVC的storageClass匹配
        local:
          path: /mnt/data/etcd-0  # 节点上的实际路径
        nodeAffinity:
          required:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values: ["ubuntu"]  # 绑定到特定节点
      
      ---
      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: etcd-data-1
      spec:
        capacity:
          storage: 4096M
        accessModes: [ReadWriteOnce]
        persistentVolumeReclaimPolicy: Retain
        storageClassName: local-storage
        local:
          path: /mnt/data/etcd-1
        nodeAffinity:
          required:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values: ["worker-80-39"]
      
      ---
      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: etcd-data-2
      spec:
        capacity:
          storage: 4096M
        accessModes: [ReadWriteOnce]
        persistentVolumeReclaimPolicy: Retain
        storageClassName: local-storage
        local:
          path: /mnt/data/etcd-2
        nodeAffinity:
          required:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values: ["worker-153"]

      关键参数如下所示:

      • path:对应节点的路径,必须真实且存在。
      • values:待部署的节点名称。
    2. 在K8s集群的master节点执行以下命令。
      kubectl apply -f  local-pvs.yaml

      返回结果如下所示表示创建成功:

      persistentvolume/etcd data-0 created
      persistentvolume/etcd-data-1 created
      persistentvolume/etcd-data-2 created
    3. 使用以下命令在3个节点上打上app=etcd标签。
      kubectl label nodes <节点名> app=etcd
    4. 使用以下命令自行创建etcd.yaml文件,配置ETCD Pod侧证书。
      vim etcd.yaml

      根据3生成的相关证书文件,将文件生成路径挂载至ETCD容器内,并配置ETCD使用加密通信,指定使用ca.pem、server.pem和server.key进行通信,关键配置信息如以下加粗内容。

      # etcd.yaml 在3个节点上创建同步的ETCD数据库
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: etcd
        namespace: default
      spec:
        type: ClusterIP
        clusterIP: None # Headless Service,用于StatefulSet的DNS解析
        selector:
          app: etcd  # 选择标签为app=etcd的Pod
        publishNotReadyAddresses: true  # 允许未就绪Pod被DNS发现
        ports:
          - name: etcd-client
            port: 2379 # 客户端通信端口
          - name: etcd-server
            port: 2380 # 节点间通信端口
          - name: etcd-metrics
            port: 8080 # etcd 集群管控端口
      ---
      apiVersion: apps/v1
      kind: StatefulSet
      metadata:
        name: etcd
        namespace: default
      spec:
        serviceName: etcd # 绑定 Headless Service
        replicas: 3 # 奇数节点保证Raft
        podManagementPolicy: OrderedReady # 允许并行启动(需配合初始化脚本)
        updateStrategy:
          type: RollingUpdate # 滚动更新策略
        selector:
          matchLabels:
            app: etcd # 匹配 Pod 标签
        template:
          metadata:
            labels:
              app: etcd # Pod 标签
            annotations:
              serviceName: etcd
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchExpressions:
                        - key: app
                          operator: In
                          values: [etcd]
                    topologyKey: "kubernetes.io/hostname" # 跨节点部署
            containers:
              - name: etcd
                image: k8s.gcr.io/etcd:3.4.13-0
                imagePullPolicy: IfNotPresent
                ports:
                  - name: etcd-client
                    containerPort: 2379
                  - name: etcd-server
                    containerPort: 2380
                  - name: etcd-metrics
                    containerPort: 8080
      #          readinessProbe:
      #            httpGet:
      #              path: /readyz
      #              port: 8080
      #            initialDelaySeconds: 10
      #            periodSeconds: 5
      #            timeoutSeconds: 5
      #            successThreshold: 1
      #            failureThreshold: 30
      #          livenessProbe:
      #            httpGet:
      #              path: /livez
      #              port: 8080
      #            initialDelaySeconds: 15
      #            periodSeconds: 10
      #            timeoutSeconds: 5
      #            failureThreshold: 3
                env:
                  - name: K8S_NAMESPACE
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.namespace
                  - name: HOSTNAME
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.name
                  - name: SERVICE_NAME
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.annotations['serviceName']
                  - name: ETCDCTL_ENDPOINTS
                    value: "$(HOSTNAME).$(SERVICE_NAME):2379"
                  - name: URI_SCHEME
                    value: "https"
                command:
                  - /usr/local/bin/etcd
                args:
                  - --log-level=debug
                  - --name=$(HOSTNAME) # 节点唯一标识
                  - --data-dir=/data # 数据存储路径
                  - --wal-dir=/data/wal
                  - --listen-peer-urls=https://0.0.0.0:2380 # 监管节点间通信
                  - --listen-client-urls=https://0.0.0.0:2379 # 监管客户端请求
                  - --advertise-client-urls=https://$(HOSTNAME).$(SERVICE_NAME):2379  # 客户端地址
                  - --initial-cluster-state=new # 新集群初始化模式
                  - --initial-cluster-token=etcd-$(K8S_NAMESPACE) # 集群唯一标识
                  - --initial-cluster=etcd-0=https://etcd-0.etcd:2380,etcd-1=https://etcd-1.etcd:2380,etcd-2=https://etcd-2.etcd:2380 # 初始节点列表
                  - --initial-advertise-peer-urls=https://$(HOSTNAME).$(SERVICE_NAME):2380 # 对外公布的节点间通信地址
                  - --listen-metrics-urls=http://0.0.0.0:8080 # 集群管控地址
                  - --quota-backend-bytes=8589934592
                  - --auto-compaction-retention=5m
                  - --auto-compaction-mode=revision
                  - --client-cert-auth
                  - --cert-file=/etc/ssl/certs/etcdca/server.pem
                  - --key-file=/etc/ssl/certs/etcdca/server.key
                  - --trusted-ca-file=/etc/ssl/certs/etcdca/ca.pem
                  - --peer-client-cert-auth
                  - --peer-trusted-ca-file=/etc/ssl/certs/etcdca/ca.pem
                  - --peer-cert-file=/etc/ssl/certs/etcdca/server.pem
                  - --peer-key-file=/etc/ssl/certs/etcdca/server.key
                volumeMounts:
                  - name: etcd-data
                    mountPath: /data # 挂载持久化存储
                  - name: etcd-ca
                    mountPath: /etc/ssl/certs/etcdca # 物理机/home/{用户名}/auto_gen_ms_cert目录在容器中的挂载路径
            volumes:
              - name: crt
                hostPath:
                  path: /usr/local/Ascend/driver
              - name: etcd-ca
                hostPath:
                  path: /home/{用户名}/auto_gen_ms_cert # 物理机创建文件及生成文件路径
                  type: Directory
        volumeClaimTemplates:
          - metadata:
              name: etcd-data
            spec:
              accessModes: [ "ReadWriteOnce" ] # 单节点读写
              storageClassName: local-storage
              resources:
                requests:
                  storage: 4096M # 存储空间
    5. 配置Controller侧证书。

      根据3生成的相关证书文件,将文件生成路径挂载至Controller容器内,即修改deployment/controller_init.yaml文件。

      apiVersion: mindxdl.gitee.com/v1
      kind: AscendJob
      metadata:
        name: mindie-ms-controller        #以具体任务为准, xxxx默认mindie-ms
        namespace: mindie            #以MindIE为准,用户可修改
        labels:
          framework: pytorch
          app: mindie-ms-controller      #固定
          jobID: xxxx               # 推理任务的名称,用户需配置,追加,xxxx默认mindie-ms
          ring-controller.atlas: ascend-910b
      spec:
        schedulerName: volcano   # work when enableGangScheduling is true
        runPolicy:
          schedulingPolicy:      # work when enableGangScheduling is true
            minAvailable: 1      # 保持和replicas一致
            queue: default
        successPolicy: AllWorkers
        replicaSpecs:
          Master:
            replicas: 1           # controller的副本数
            restartPolicy: Always
            template:
              metadata:
                labels:
                  #ring-controller.atlas: ascend-910b
                  app: mindie-ms-controller
                  jobID: xxxx      #推理任务的名称,用户需配置,追加,xxxx默认mindie
              spec:                              # 保持默认值
                nodeSelector:
                  masterselector: dls-master-node
                  #accelerator: huawei-Ascend910
                  # machine-id: "7"
                terminationGracePeriodSeconds: 0
                automountServiceAccountToken: false
                securityContext:
                  fsGroup: 1001
                containers:
                  - image: mindie:dev-2.0.RC1-B081-800I-A2-py311-ubuntu22.04-aarch64
                    imagePullPolicy: IfNotPresent
                    name: ascend
                    securityContext:
                      # allowPrivilegeEscalation: false
                      privileged: true
                      capabilities:
                        drop: [ "ALL" ]
                      seccompProfile:
                        type: "RuntimeDefault"
                    readinessProbe:
                      exec:
                        command:
                          - bash
                          - -c
                          - "$MIES_INSTALL_PATH/scripts/http_client_ctl/probe.sh readiness"
                      periodSeconds: 5
                    livenessProbe:
                      exec:
                        command:
                          - bash
                          - -c
                          - "$MIES_INSTALL_PATH/scripts/http_client_ctl/probe.sh liveness"
                      periodSeconds: 5
                    startupProbe:
                      exec:
                        command:
                          - bash
                          - -c
                          - "$MIES_INSTALL_PATH/scripts/http_client_ctl/probe.sh startup"
                      periodSeconds: 5
                      failureThreshold: 100
                    env:
                      - name: POD_IP
                        valueFrom:
                          fieldRef:
                            fieldPath: status.podIP
                      - name: GLOBAL_RANK_TABLE_FILE_PATH
                        value: "/user/serverid/devindex/config/..data/global_ranktable.json"
                      - name: MIES_INSTALL_PATH
                        value: $(MINDIE_USER_HOME_PATH)/Ascend/mindie/latest/mindie-service
                      - name: CONFIG_FROM_CONFIGMAP_PATH
                        value: /mnt/configmap
                      - name: CONTROLLER_LOG_CONFIG_PATH
                        value: /root/mindie
                    envFrom:
                      - configMapRef:
                          name: common-env
                    command: [ "/bin/bash", "-c", "
                        /mnt/configmap/boot.sh; \n
                    " ]
                    resources:
                      requests:
                        memory: "2Gi"
                        cpu: "4"
                      limits:
                        memory: "4Gi"
                        cpu: "8"
                    volumeMounts:
                      - name: global-ranktable
                        mountPath: /user/serverid/devindex/config
                      - name: mindie-http-client-ctl-config
                        mountPath: /mnt/configmap/http_client_ctl.json
                        subPath: http_client_ctl.json
                      - name: python-script-get-group-id
                        mountPath: /mnt/configmap/get_group_id.py
                        subPath: get_group_id.py
                      - name: boot-bash-script
                        mountPath: /mnt/configmap/boot.sh
                        subPath: boot.sh
                      - name: mindie-ms-controller-config
                        mountPath: /mnt/configmap/ms_controller.json
                        subPath: ms_controller.json
                      - name: status-data
                        mountPath: /usr/local/Ascend/mindie/latest/mindie-service/logs
                      - name: localtime
                        mountPath: /etc/localtime
                      - name: mnt
                        mountPath: /mnt
      		- name: controller-ca
      		  mountPath: /etc/ssl/certs/etcdca # 物理机/home/{用户名}/auto_gen_ms_cert目录在容器中的挂载路径
                volumes:
                  - name: localtime
                    hostPath:
                      path: /etc/localtime
                  - name: global-ranktable
                    configMap:
                      name: global-ranktable
                      defaultMode: 0640
                  - name: mindie-http-client-ctl-config
                    configMap:
                      name: mindie-http-client-ctl-config
                      defaultMode: 0640
                  - name: python-script-get-group-id
                    configMap:
                      name: python-script-get-group-id
                      defaultMode: 0640
                  - name: boot-bash-script
                    configMap:
                      name: boot-bash-script
                      defaultMode: 0550
                  - name: mindie-ms-controller-config
                    configMap:
                      name: mindie-ms-controller-config
                      defaultMode: 0640
                  - name: status-data
                    hostPath:
                      path: /data/mindie-ms/status
                      type: Directory
                  - name: mnt
                    hostPath:
                      path: /mnt
                  - name: controller-ca
                    hostPath:
                      path: /home/{用户名}/auto_gen_ms_cert # 物理机创建文件及生成文件路径
                      type: Directory
    6. 配置user_config.json配置文件。
      • 修改“tls_enable”“true”,使CA认证流程生效;
      • 配置kmc加密工具,即配置“kmc_ksf_master”“kmc_ksf_standby”参数;
      • 配置“infer_tls_items”“management_tls_items”相关证书,详情请参见功能介绍
      • 修改“etcd_server_tls_enable”“true”,并将生成好的客户端证书配置到“etcd_server_tls_items”中。
      {
        "version": "v1.0",
        "deploy_config": {
          "p_instances_num": 1,
          "d_instances_num": 1,
          "single_p_instance_pod_num": 2,
          "single_d_instance_pod_num": 8,
          "p_pod_npu_num": 8,
          "d_pod_npu_num": 8,
          "prefill_distribute_enable": 0,
          "decode_distribute_enable": 1,
          "image_name": "mindie:dev-2.0.RC1.B091-800I-A2-py311-ubuntu22.04-aarch64",
          "job_id": "mindie",
          "hardware_type": "800I_A2",
          "mindie_env_path": "./conf/mindie_env.json",
          "mindie_host_log_path": "/root/log/ascend_log",
          "mindie_container_log_path": "/root/mindie",
          "weight_mount_path": "/mnt/mindie_data/deepseek_diff_level/deepseek_r1_w8a8_ep_level2",
          "controller_backup_cfg": {
            "function_sw": false
          },
      
          "tls_config": {
            "tls_enable": true,
            "kmc_ksf_master": "/etc/ssl/certs/etcdca/tools/pmt/master/ksfa",
            "kmc_ksf_standby": "/etc/ssl/certs/etcdca/tools/pmt/standby/ksfb",
            "infer_tls_items": {
              "ca_cert": "./security/infer/security/certs/ca.pem",
              "tls_cert": "./security/infer/security/certs/cert.pem",
              "tls_key": "./security/infer/security/keys/cert.key.pem",
              "tls_passwd": "./security/infer/security/pass/key_pwd.txt",
              "tls_crl": ""
            },
            "management_tls_items": {
              "ca_cert": "./security/management/security/certs/management_ca.pem",
              "tls_cert": "./security/management/security/certs/cert.pem",
              "tls_key": "./security/management/security/keys/cert.key.pem",
              "tls_passwd": "./security/management/security/pass/key_pwd.txt",
              "tls_crl": ""
            },
            "cluster_tls_enable": true,
            "cluster_tls_items": {
              "ca_cert": "./security/clusterd/security/certs/ca.pem",
              "tls_cert": "./security/clusterd/security/certs/cert.pem",
              "tls_key": "./security/clusterd/security/keys/cert.key.pem",
              "tls_passwd": "./security/clusterd/security/pass/key_pwd.txt",
              "tls_crl": ""
            },
            "etcd_server_tls_enable": true,
            "etcd_server_tls_items": {
              "ca_cert" : "/etc/ssl/certs/etcdca/ca.pem",
              "tls_cert": "/etc/ssl/certs/etcdca/client.pem",
              "tls_key": "/etc/ssl/certs/etcdca/client.key.enc",
              "tls_passwd": "/etc/ssl/certs/etcdca/key_pwd.txt",
              "kmc_ksf_master": "/etc/ssl/certs/etcdca/tools/pmt/master/ksfa",
              "kmc_ksf_standby": "/etc/ssl/certs/etcdca/tools/pmt/standby/ksfb",
              "tls_crl": "/etc/ssl/certs/etcdca/server_crl.pem"
            }
          }
        },
      ...
      }
    7. 在K8s集群master节点执行如下命令。
      kubectl apply -f etcd.yaml

      返回结果如下所示表示创建成功:

      service/etcd created
      statefulset.apps/etcd created
    8. 执行以下命令查询ETCD集群的Pod。
      kubectl get pod -A

      回显如下所示:

      NAMESPACE	NAME	READY	STATUS	  RESTARTS	AGE	IP 	          NODE	          NOMINATED NODE	  READINESS GATES
      default         etcd-0	1/1	Running	  0	        44h	xxx.xxx.xxx.xxx   ubuntu          <none>	          <none>
      default         etcd-1	1/1     Running	  0	        44h	xxx.xxx.xxx.xxx   worker-153      <none>                  <none>
      default         etcd-2  1/1     Running	  0             44h	xxx.xxx.xxx.xxx   worker-80-39    <none>                  <none>

    如果要修改ETCD集群中的yaml文件,重新创建ETCD资源,则需要先执行删除,命令如下:

    kubectl delete -f etcd.yaml && kubectl delete pvc --all && kubectl delete pv etcd-data-0 etcd-data-1 etcd-data-2
  3. 设置Controller反亲和性调度。

    反亲和性调度是由deployment/controller_init.yaml文件配置,默认未配置,则会把两个Controller资源都调度到同一个节点上。配置反亲和性调度需要将以下内容粘贴至文件中的spec字段下方,即可将Controller主备调度到不同的节点上。

    affinity:
      podAntiAffinity:
        requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
                - key: app
                  operator: In
                  values:
                  - mindie-ms-controller
            topologyKey: kubernetes.io/hostname

    必须将controller_init.yaml中的masterselector参数所在行进行注释,才能部署在目标节点上,否则将会部署在master节点上,导致部署失败。

  4. 使用以下命令将需要部署的两个Controller节点设置为同一标签。
    kubectl label node worker-155 accelerator=huawei-Ascend310P
    • node:资源类型。
    • worker-155:节点名称,根据实际名称填写。
    • accelerator=huawei-Ascend310P:标签键和标签值,根据实际标签值填写。
      • 标签值可通过以下命令进行查看,回显中accelerator的值即为标签值。
        kubectl get node -A --show-labels | grep accelerator
      • 如果使用以上命令未查询到标签值,则使用以下命令根据实际使用的NPU设备类型为节点打上accelerator=huawei-Ascend910或者accelerator=huawei-Ascend310P标签。
        kubectl label node {节点名称} accelerator=huawei-Ascend910

    即可将controller_init.yaml文件中的nodeSelector字段下“accelerator”参数配置为上述节点标签,如下所示:

  5. 在user_config.json配置文件中开启允许部署两个Controller节点参数,配置参数如下所示。
    ...
            "controller_backup_cfg": {
              "function_sw": true
            },
    ...
    • false:关闭;
    • true:开启。