昇腾社区首页
中文
注册
开发者
下载

使用样例

限制与约束

  • 主、备Coordinator节点不支持部署在同一台节点上。
  • ETCD服务端需要使用v3.6版本。
  • 特性生效依赖ETCD服务端正确部署,服务端需要3个副本,以保证ETCD集群的可靠性。
  • 如果已经部署Controller主备倒换,则可以与Controller共用一套ETCD集群,跳过生成ETCD安全证书和部署ETCD服务端操作步骤,直接参考配置并启动MindIE部署Coordinator主备倒换。

生成ETCD安全证书

Coordinator主备依赖ETCD分布式锁功能,涉及集群内不同POD间通信,建议使用CA证书做双向认证,证书配置请参考如下步骤:

如果不使用CA证书做双向认证加密通信,则服务间将进行明文传输,可能会存在较高的网络安全风险。

  1. 请用户自行准备证书生成的相关前置文件,文件放置目录以/home/{用户名}/auto_gen_ms_cert为例。
    • server.cnf
      [req] # 主要请求内容
      req_extensions = v3_req
      distinguished_name = req_distinguished_name
      
      [req_distinguished_name] # 证书主体信息
      countryName = CN
      stateOrProvinceName = State
      localityName = City
      organizationName = Organization
      organizationalUnitName = Unit
      commonName = etcd-server
      
      [v3_req] # 核心属性
      basicConstraints = CA:FALSE
      keyUsage = digitalSignature, keyEncipherment
      extendedKeyUsage = serverAuth, clientAuth
      subjectAltName = @alt_names
      
      [alt_names] # 服务标识
      DNS.1 = etcd
      DNS.2 = etcd.default
      DNS.3 = etcd.default.svc
      DNS.4 = etcd.default.svc.cluster.local
      DNS.5 = etcd-0.etcd
      DNS.6 = etcd-0.etcd.default.svc.cluster.local
      DNS.7 = etcd-1.etcd
      DNS.8 = etcd-1.etcd.default.svc.cluster.local
      DNS.9 = etcd-2.etcd
      DNS.10 = etcd-2.etcd.default.svc.cluster.local
    • client.cnf
      [req] # 主要请求内容
      req_extensions = v3_req
      distinguished_name = req_distinguished_name
      
      [req_distinguished_name] # 证书主体信息
      countryName = CN
      stateOrProvinceName = State
      localityName = City
      organizationName = Organization
      organizationalUnitName = Unit
      commonName = etcd-client
      
      [v3_req] # 核心属性
      basicConstraints = CA:FALSE
      keyUsage = digitalSignature, keyEncipherment
      extendedKeyUsage = clientAuth
      subjectAltName = @alt_names
      
      [alt_names] # 服务标识
      DNS.1 = mindie-service-controller
      DNS.2 = mindie-service-coordinator
    • crl.conf
      # OpenSSL configuration for CRL generation
      #
      ####################################################################
      [ ca ] # CA框架声明,指示OpenSSL使用哪个预定义的CA配置块作为默认设置
      default_ca = CA_default # The default ca section
      ####################################################################
      [ CA_default ] # 核心CA设置,所有关键路径、文件和默认操作
      dir             = {dir}  # 添加此根目录定义
      database        = $dir/index.txt
      crlnumber       = $dir/pulp_crl_number
      new_certs_dir   = $dir/newcerts
      certificate     = $dir/ca.pem
      private_key     = $dir/ca.key
      serial          = $dir/serial
      database = /home/{用户名}/auto_gen_ms_cert/etcdca4/index.txt
      crlnumber = /home/{用户名}/auto_gen_ms_cert/etcdca4/pulp_crl_number
      
      
      default_days = 365 # how long to certify for
      default_crl_days= 365 # how long before next CRL
      default_md = default # use public key default MD
      preserve = no # keep passed DN ordering
      
      ####################################################################
      [ crl_ext ] # CRL扩展属性
      # CRL extensions.
      # Only issuerAltName and authorityKeyIdentifier make any sense in a CRL.
      # issuerAltName=issuer:copy
      authorityKeyIdentifier=keyid:always,issuer:always

      文件中{dir}路径建议为各节点都能访问的共享目录。

    • gen_etcd_ca.sh
      ```bash
      #!/bin/bash
      # 1. 创建所需文件和目录
      mkdir -p /home/{用户名}/auto_gen_ms_cert/etcdca4/newcerts
      touch /home/{用户名}/auto_gen_ms_cert/etcdca4/index.txt
      echo 1000 > /home/{用户名}/auto_gen_ms_cert/etcdca4/pulp_crl_number
      echo "01" > /home/{用户名}/auto_gen_ms_cert/etcdca4/serial
      
      # 2. 设置权限
      chmod 700 /home/{用户名}/auto_gen_ms_cert/etcdca4/newcerts
      chmod 600 /home/{用户名}/auto_gen_ms_cert/etcdca4/{index.txt,pulp_crl_number,serial}
      
      # 3. 创建CA证书
      openssl genrsa -out ca.key 2048
      openssl req -x509 -new -nodes -key ca.key \
      -subj "/CN=my-cluster-ca" \
      -days 3650 -out ca.pem
      
      # 4. 生成服务端证书
      openssl genrsa -out server.key 2048
      openssl req -new -key server.key -out server.csr \
      -subj "/CN=etcd-server" -config server.cnf
      openssl x509 -req -in server.csr -CA ca.pem -CAkey ca.key -CAcreateserial \
      -out server.pem -days 3650 -extensions v3_req -extfile server.cnf
      
      #5. 吊销列表
      openssl ca -passin pass:{password} -revoke server.pem -keyfile ca.key -cert ca.pem -config crl.conf
      openssl ca -passin pass:{password}  -gencrl -keyfile ca.key -cert ca.pem -out server_crl.pem -config crl.conf
      
      # 6. 生成客户端证书
      openssl genrsa -out client.key 2048
      openssl req -new -key client.key -out client.csr \
      -subj "/CN=etcd-client" -config client.cnf
      openssl x509 -req -in client.csr -CA ca.pem -CAkey ca.key -CAcreateserial \
      -out client.pem -days 3650 -extensions v3_req -extfile client.cnf
      
      # 7. 加密公钥 (使用 KMC)
      # 以kmc_encrypt加密工具为例,并自行配置{password}
      kmc_encrypt -in client.key -out client.key.enc -key_id {password}
      
      # 8. 设置权限
      chmod 0400 ./*.key
      chmod 0400 ./*.pem
  2. 创建crl.conf所需文件和目录。
    touch index.txt
    touch pulp_crl_number
    mkdir newcerts
  3. 执行以下命令运行gen_etcd_ca.sh,生成服务端证书、客户端证书、吊销列表等文件。
    bash gen_etcd_ca.sh

    回显类似如下内容表示生成成功:

    Generating RSA private key, 2048 bit long modulus (2 primes)
    .................+++++
    ............................................................................................................................................................................................................................................................................................................................................................................................................+++++
    e is 65537 (0x010001)
    Generating RSA private key, 2048 bit long modulus (2 primes)
    ..............................................+++++
    ...................................+++++
    e is 65537 (0x010001)
    Signature ok
    subject=CN = etcd-server
    Getting CA Private Key
    Using configuration from crl.conf
    Adding Entry with serial number 3D619FEFB51EEA23F8707008E3C8FACFC8D45547 to DB for /CN=etcd-server
    Revoking Certificate 3D619FEFB51EEA23F8707008E3C8FACFC8D45547.
    Data Base Updated
    Using configuration from crl.conf
    Generating RSA private key, 2048 bit long modulus (2 primes)
    .........................................................+++++
    ...........+++++
    e is 65537 (0x010001)
    Signature ok
    subject=CN = etcd-client
    Getting CA Private Key

    运行完成后,在当前目录生成以下文件或目录:

    ca.key
    ca.pem
    ca.srl
    client.cnf
    client.csr
    client_encrypted.key
    client.key
    client.pem
    crl.conf
    gen_etcd_ca.sh
    index.txt
    index.txt.attr
    index.txt.attr.old
    index.txt.old
    key_pwd.txt
    newcerts
    pulp_crl_number
    pulp_crl_number.old
    serial
    server.cnf
    server_crl.pem
    server.csr
    server.key
    server.pem

部署ETCD服务端

  1. 使用以下命令加载ETCD镜像。
    docker pull quay.io/coreos/etcd:v3.6.0-rc.4
    • 如果docker pull失败,可以用podman命令下载ETCD镜像后保存,再使用docker load命令导入,命令如下:
      apt install podman
      podman pull quay.io/coreos/etcd:v3.6.0-rc.4
    • 需要在三个节点上加载此镜像。
  2. 在集群中创建ETCD资源。
    1. 使用以下命令自行创建local-pvs.yaml文件。
      vim local-pvs.yaml

      在文件中写入以下内容:

      # local-pvs.yaml 创建PV
      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: etcd-data-0  # 必须与StatefulSet的PVC命名规则匹配
      spec:
        capacity:
          storage: 4096M
        volumeMode: Filesystem
        accessModes: [ReadWriteOnce]
        persistentVolumeReclaimPolicy: Retain
        storageClassName: local-storage  # 必须与PVC的storageClass匹配
        local:
          path: /mnt/data/etcd-0  # 节点上的实际路径
        nodeAffinity:
          required:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values: ["ubuntu"]  # 绑定到特定节点
      
      ---
      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: etcd-data-1
      spec:
        capacity:
          storage: 4096M
        accessModes: [ReadWriteOnce]
        persistentVolumeReclaimPolicy: Retain
        storageClassName: local-storage
        local:
          path: /mnt/data/etcd-1
        nodeAffinity:
          required:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values: ["worker-80-39"]
      
      ---
      apiVersion: v1
      kind: PersistentVolume
      metadata:
        name: etcd-data-2
      spec:
        capacity:
          storage: 4096M
        accessModes: [ReadWriteOnce]
        persistentVolumeReclaimPolicy: Retain
        storageClassName: local-storage
        local:
          path: /mnt/data/etcd-2
        nodeAffinity:
          required:
            nodeSelectorTerms:
              - matchExpressions:
                  - key: kubernetes.io/hostname
                    operator: In
                    values: ["worker-153"]

      关键参数如下所示:

      • path:对应节点的路径,必须真实且存在。
      • values:待部署的节点名称。
    2. 在K8s集群的master节点执行以下命令。
      kubectl apply -f  local-pvs.yaml

      返回结果如下所示表示创建成功:

      persistentvolume/etcd data-0 created
      persistentvolume/etcd-data-1 created
      persistentvolume/etcd-data-2 created
    3. 使用以下命令在3个节点上打上app=etcd标签。
      kubectl label nodes <节点名> app=etcd
    4. 使用以下命令自行创建etcd.yaml文件,配置ETCD Pod侧证书。
      vim etcd.yaml

      根据3生成的相关证书文件,将文件生成路径挂载至ETCD容器内,并配置ETCD使用加密通信,指定使用ca.pem、server.pem和server.key进行通信,关键配置信息如以下加粗内容。

      # etcd.yaml 在3个节点上创建同步的ETCD数据库
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: etcd
        namespace: default
      spec:
        type: ClusterIP
        clusterIP: None # Headless Service,用于StatefulSet的DNS解析
        selector:
          app: etcd  # 选择标签为app=etcd的Pod
        publishNotReadyAddresses: true  # 允许未就绪Pod被DNS发现
        ports:
          - name: etcd-client
            port: 2379 # 客户端通信端口
          - name: etcd-server
            port: 2380 # 节点间通信端口
          - name: etcd-metrics
            port: 8080 # etcd 集群管控端口
      ---
      apiVersion: apps/v1
      kind: StatefulSet
      metadata:
        name: etcd
        namespace: default
      spec:
        serviceName: etcd # 绑定 Headless Service
        replicas: 3 # 奇数节点保证Raft
        podManagementPolicy: OrderedReady # 允许并行启动(需配合初始化脚本)
        updateStrategy:
          type: RollingUpdate # 滚动更新策略
        selector:
          matchLabels:
            app: etcd # 匹配 Pod 标签
        template:
          metadata:
            labels:
              app: etcd # Pod 标签
            annotations:
              serviceName: etcd
          spec:
            affinity:
              podAntiAffinity:
                requiredDuringSchedulingIgnoredDuringExecution:
                  - labelSelector:
                      matchExpressions:
                        - key: app
                          operator: In
                          values: [etcd]
                    topologyKey: "kubernetes.io/hostname" # 跨节点部署
            containers:
              - name: etcd
                image: k8s.gcr.io/etcd:3.4.13-0
                imagePullPolicy: IfNotPresent
                ports:
                  - name: etcd-client
                    containerPort: 2379
                  - name: etcd-server
                    containerPort: 2380
                  - name: etcd-metrics
                    containerPort: 8080
      #          readinessProbe:
      #            httpGet:
      #              path: /readyz
      #              port: 8080
      #            initialDelaySeconds: 10
      #            periodSeconds: 5
      #            timeoutSeconds: 5
      #            successThreshold: 1
      #            failureThreshold: 30
      #          livenessProbe:
      #            httpGet:
      #              path: /livez
      #              port: 8080
      #            initialDelaySeconds: 15
      #            periodSeconds: 10
      #            timeoutSeconds: 5
      #            failureThreshold: 3
                env:
                  - name: MY_POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                  - name: K8S_NAMESPACE
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.namespace
                  - name: HOSTNAME
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.name
                  - name: SERVICE_NAME
                    valueFrom:
                      fieldRef:
                        fieldPath: metadata.annotations['serviceName']
                  - name: ETCDCTL_ENDPOINTS
                    value: "$(HOSTNAME).$(SERVICE_NAME):2379"
                  - name: URI_SCHEME
                    value: "https"
                command:
                  - /usr/local/bin/etcd
                args:
                  - --log-level=debug
                  - --name=$(HOSTNAME) # 节点唯一标识
                  - --data-dir=/data # 数据存储路径
                  - --wal-dir=/data/wal
                  - --listen-peer-urls=https://$(MY_POD_IP):2380 # 监管节点间通信
                  - --listen-client-urls=https://$(MY_POD_IP):2379 # 监管客户端请求
                  - --advertise-client-urls=https://$(HOSTNAME).$(SERVICE_NAME):2379  # 客户端地址
                  - --initial-cluster-state=new # 新集群初始化模式
                  - --initial-cluster-token=etcd-$(K8S_NAMESPACE) # 集群唯一标识
                  - --initial-cluster=etcd-0=https://etcd-0.etcd:2380,etcd-1=https://etcd-1.etcd:2380,etcd-2=https://etcd-2.etcd:2380 # 初始节点列表
                  - --initial-advertise-peer-urls=https://$(HOSTNAME).$(SERVICE_NAME):2380 # 对外公布的节点间通信地址
                  - --listen-metrics-urls=http://$(MY_POD_IP):8080 # 集群管控地址
                  - --quota-backend-bytes=8589934592
                  - --auto-compaction-retention=5m
                  - --auto-compaction-mode=revision
                  - --client-cert-auth
                  - --cert-file=/etc/ssl/certs/etcdca/server.pem
                  - --key-file=/etc/ssl/certs/etcdca/server.key
                  - --trusted-ca-file=/etc/ssl/certs/etcdca/ca.pem
                  - --peer-client-cert-auth
                  - --peer-trusted-ca-file=/etc/ssl/certs/etcdca/ca.pem
                  - --peer-cert-file=/etc/ssl/certs/etcdca/server.pem
                  - --peer-key-file=/etc/ssl/certs/etcdca/server.key
                volumeMounts:
                  - name: etcd-data
                    mountPath: /data # 挂载持久化存储
                  - name: etcd-ca
                    mountPath: /etc/ssl/certs/etcdca # 物理机/home/{用户名}/auto_gen_ms_cert目录在容器中的挂载路径
            volumes:
              - name: crt
                hostPath:
                  path: /usr/local/Ascend/driver
              - name: etcd-ca
                hostPath:
                  path: /home/{用户名}/auto_gen_ms_cert # 物理机创建文件及生成文件路径
                  type: Directory
        volumeClaimTemplates:
          - metadata:
              name: etcd-data
            spec:
              accessModes: [ "ReadWriteOnce" ] # 单节点读写
              storageClassName: local-storage
              resources:
                requests:
                  storage: 4096M # 存储空间
    5. 在K8s集群master节点执行如下命令。
      kubectl apply -f etcd.yaml

      返回结果如下所示表示创建成功:

      service/etcd created
      statefulset.apps/etcd created
    6. 执行以下命令查询ETCD集群的Pod。
      kubectl get pod -A

      回显如下所示:

      NAMESPACE	NAME	READY	STATUS	  RESTARTS	AGE	IP 	          NODE	          NOMINATED NODE	  READINESS GATES
      default         etcd-0	1/1	Running	  0	        44h	xxx.xxx.xxx.xxx   ubuntu          <none>	          <none>
      default         etcd-1	1/1     Running	  0	        44h	xxx.xxx.xxx.xxx   worker-153      <none>                  <none>
      default         etcd-2  1/1     Running	  0             44h	xxx.xxx.xxx.xxx   worker-80-39    <none>                  <none>

      如果要修改ETCD集群中的yaml文件,重新创建ETCD资源,则需要先执行删除,命令如下:

      kubectl delete -f etcd.yaml && kubectl delete pvc --all && kubectl delete pv etcd-data-0 etcd-data-1 etcd-data-2

配置K8s管理端

当硬件出现故障时(如机器重启),K8s集群无法迅速感知容器Pod的状态,导致推理业务无法在指定时间内恢复,可通过执行如下步骤以加快业务恢复速度。

如果不要求硬件故障影响时长,可不执行下述步骤。

  1. 使用以下命令查询K8s管理节点心跳超时标记阈值(node-monitor-grace-period),如果结果为空表示为默认值。
    kubectl describe pod <kube-controller-manager-pod 名> -n kube-system | grep "node-monitor-grace-period"
  2. 执行以下命令打开并修改配置节点心跳超时标记阈值(node-monitor-grace-period),配置文件所在路径一般存放在控制平面节点(运行kube-controller-manager的节点)的/etc/kubernetes/manifests/kube-controller-manager.yaml目录。
    vi /etc/kubernetes/manifests/kube-controller-manager.yaml
    修改内容如下所示:
    apiVersion: v1
    kind: Pod
    metadata:
      name: kube-controller-manager-<控制平面节点名>  # 如 kube-controller-manager-node-97-10
      namespace: kube-system
    spec:
      containers:
      - command:
        - kube-controller-manager
        # 其他原有参数...(保留不变)
        - --kubeconfig=/etc/kubernetes/controller-manager.conf
        - --authentication-kubeconfig=/etc/kubernetes/controller-manager.conf
        # 添加/修改 node-monitor-grace-period 参数(改为 20s)
        - --node-monitor-grace-period=20s
        # 其他原有参数...
  3. 按“Esc”键,输入:wq!,按“Enter”保存并退出编辑。
  4. 重启运行kube-controller-manager所在节点。
  5. 使用以下命令验证参数是否生效。
    kubectl describe pod <kube-controller-manager-pod 名> -n kube-system | grep "node-monitor-grace-period"

    打印以下内容则表示参数已生效:

    --node-monitor-grace-period=20s

配置并启动MindIE

  1. 配置Coordinator侧证书。

    如果要开启证书CA认证。根据3生成的相关证书文件,将文件生成路径挂载至Coordinator容器内,即修改deployment/coordinator_init.yaml文件。

    apiVersion: mindxdl.gitee.com/v1
    kind: AscendJob
    metadata:
      name: mindie-ms-coordinator        #以具体任务为准, xxxx默认mindie-ms
      namespace: mindie            #以MindIE为准,用户可修改
      labels:
        framework: pytorch
        app: mindie-ms-coordinator      #固定
        jobID: xxxx               # 推理任务的名称,用户需配置,追加,xxxx默认mindie-ms
        ring-controller.atlas: ascend-910b
    spec:
      schedulerName: volcano   # work when enableGangScheduling is true
      runPolicy:
        schedulingPolicy:      # work when enableGangScheduling is true
          minAvailable: 1      # 保持和replicas一致
          queue: default
      successPolicy: AllWorkers
      replicaSpecs:
        Master:
          replicas: 1           # Coordinator的副本数
          restartPolicy: Always
          template:
            metadata:
              labels:
                #ring-controller.atlas: ascend-910b
                app: mindie-ms-coordinator
                jobID: xxxx      #推理任务的名称,用户需配置,追加,xxxx默认mindie
            spec:                # 和之前保持一致
    #          nodeSelector:
                #accelerator: huawei-Ascend910
                # machine-id: "7"
              terminationGracePeriodSeconds: 0
              automountServiceAccountToken: false
              securityContext:
                fsGroup: 1001
              containers:
                - image: mindie:dev-2.0.RC1-B087-800I-A2-py311-ubuntu22.04-aarch64
                  imagePullPolicy: IfNotPresent
                  name: ascend
                  securityContext:
                    # allowPrivilegeEscalation: false
                    privileged: true
                    capabilities:
                      drop: [ "ALL" ]
                    seccompProfile:
                      type: "RuntimeDefault"
                  readinessProbe:
                    exec:
                      command:
                        - bash
                        - -c
                        - "$MIES_INSTALL_PATH/scripts/http_client_ctl/probe.sh readiness"
                    periodSeconds: 5
                  livenessProbe:
                    exec:
                      command:
                        - bash
                        - -c
                        - "$MIES_INSTALL_PATH/scripts/http_client_ctl/probe.sh liveness"
                    periodSeconds: 5
                    timeoutSeconds: 4
                  startupProbe:
                    exec:
                      command:
                        - bash
                        - -c
                        - "$MIES_INSTALL_PATH/scripts/http_client_ctl/probe.sh startup"
                    periodSeconds: 5
                    failureThreshold: 100
                  env:
                    - name: POD_IP
                      valueFrom:
                        fieldRef:
                          fieldPath: status.podIP
                    - name: GLOBAL_RANK_TABLE_FILE_PATH
                      value: "/user/serverid/devindex/config/..data/global_ranktable.json"
                    - name: MIES_INSTALL_PATH
                      value: $(MINDIE_USER_HOME_PATH)/Ascend/mindie/latest/mindie-service
                    - name: CONFIG_FROM_CONFIGMAP_PATH
                      value: /mnt/configmap
                    - name: COORDINATOR_LOG_CONFIG_PATH
                      value: /root/mindie
                  envFrom:
                    - configMapRef:
                        name: common-env
                  command: [ "/bin/bash", "-c", "
                          /mnt/configmap/boot.sh; \n
                      " ]
                  resources:
                    requests:
                      memory: "4Gi"
                      cpu: "16"
                    limits:
                      memory: "8Gi"
                      cpu: "64"
                  volumeMounts:
                    - name: global-ranktable
                      mountPath: /user/serverid/devindex/config
                    - name: mindie-http-client-ctl-config
                      mountPath: /mnt/configmap/http_client_ctl.json
                      subPath: http_client_ctl.json
                    - name: python-script-get-group-id
                      mountPath: /mnt/configmap/get_group_id.py
                      subPath: get_group_id.py
                    - name: boot-bash-script
                      mountPath: /mnt/configmap/boot.sh
                      subPath: boot.sh
                    - name: mindie-ms-controller-config
                      mountPath: /mnt/configmap/ms_controller.json
                      subPath: ms_controller.json
                    - name: mindie-ms-coordinator-config
                      mountPath: /mnt/configmap/ms_coordinator.json
                      subPath: ms_coordinator.json
                    - name: localtime
                      mountPath: /etc/localtime
                    - name: mnt
                      mountPath: /mnt
                    - name: coredump
                      mountPath: /var/coredump
    		- name: coordinator-ca
    		  mountPath: /etc/ssl/certs/etcdca # 物理机/home/{用户名}/auto_gen_ms_cert目录在容器中的挂载路径
              volumes:
                - name: localtime
                  hostPath:
                    path: /etc/localtime
                - name: global-ranktable
                  configMap:
                    name: global-ranktable
                    defaultMode: 0640
                - name: mindie-http-client-ctl-config
                  configMap:
                    name: mindie-http-client-ctl-config
                    defaultMode: 0640
                - name: python-script-get-group-id
                  configMap:
                    name: python-script-get-group-id
                    defaultMode: 0640
                - name: boot-bash-script
                  configMap:
                    name: boot-bash-script
                    defaultMode: 0550
                - name: mindie-ms-controller-config
                  configMap:
                    name: mindie-ms-controller-config
                    defaultMode: 0640
                - name: mindie-ms-coordinator-config
                  configMap:
                    name: mindie-ms-coordinator-config
                    defaultMode: 0640
                - name: mnt
                  hostPath:
                    path: /mnt
                - name: coredump
                  hostPath:
                    path: /var/coredump
                    type: DirectoryOrCreate
    	    - name: coordinator-ca
                  hostPath:
                    path: /home/{用户名}/auto_gen_ms_cert # 物理机创建文件及生成文件路径
                    type: Directory
    ---
    apiVersion: v1
    kind: Service
    metadata:
      labels:
        app: mindie-ms-coordinator
        jobID: xxxx                     # 推理任务的名称,用户需配置,追加,xxxx默认mindie-ms
      name: mindie-ms-coordinator-infer
      namespace: mindie
    spec:
      ports:
        - nodePort: 31015
          port: 1025
          protocol: TCP
          targetPort: 1025
      selector:
        app: mindie-ms-coordinator
      sessionAffinity: None
      type: NodePort
    status:
      loadBalancer: {}
  2. 配置user_config.json配置文件。
    • 修改“tls_enable”“true”,使CA认证流程生效;
    • 配置kmc加密工具,即配置“kmc_ksf_master”“kmc_ksf_standby”参数;
    • 配置“infer_tls_items”“management_tls_items”相关证书,详情请参见功能介绍
    • 修改“etcd_server_tls_enable”“true”,并将生成好的客户端证书配置到“etcd_server_tls_items”中。
    {
      "version": "v1.0",
      "deploy_config": {
        "p_instances_num": 1,
        "d_instances_num": 1,
        "single_p_instance_pod_num": 2,
        "single_d_instance_pod_num": 8,
        "p_pod_npu_num": 8,
        "d_pod_npu_num": 8,
        "prefill_distribute_enable": 0,
        "decode_distribute_enable": 1,
        "image_name": "mindie:dev-2.0.RC1.B091-800I-A2-py311-ubuntu22.04-aarch64",
        "job_id": "mindie",
        "hardware_type": "800I_A2",
        "mindie_env_path": "./conf/mindie_env.json",
        "mindie_host_log_path": "/root/log/ascend_log",
        "mindie_container_log_path": "/root/mindie",
        "weight_mount_path": "/mnt/mindie_data/deepseek_diff_level/deepseek_r1_w8a8_ep_level2",
        "coordinator_backup_cfg": {
          "function_enable": true
        },
        "controller_backup_cfg": {
          "function_sw": false
        },
    
        "tls_config": {
          "tls_enable": true,
          "kmc_ksf_master": "/etc/ssl/certs/etcdca/tools/pmt/master/ksfa",
          "kmc_ksf_standby": "/etc/ssl/certs/etcdca/tools/pmt/standby/ksfb",
          "infer_tls_items": {
            "ca_cert": "./security/infer/security/certs/ca.pem",
            "tls_cert": "./security/infer/security/certs/cert.pem",
            "tls_key": "./security/infer/security/keys/cert.key.pem",
            "tls_passwd": "./security/infer/security/pass/key_pwd.txt",
            "tls_crl": ""
          },
          "management_tls_items": {
            "ca_cert": "./security/management/security/certs/management_ca.pem",
            "tls_cert": "./security/management/security/certs/cert.pem",
            "tls_key": "./security/management/security/keys/cert.key.pem",
            "tls_passwd": "./security/management/security/pass/key_pwd.txt",
            "tls_crl": ""
          },
          "cluster_tls_enable": true,
          "cluster_tls_items": {
            "ca_cert": "./security/clusterd/security/certs/ca.pem",
            "tls_cert": "./security/clusterd/security/certs/cert.pem",
            "tls_key": "./security/clusterd/security/keys/cert.key.pem",
            "tls_passwd": "./security/clusterd/security/pass/key_pwd.txt",
            "tls_crl": ""
          },
          "etcd_server_tls_enable": true,
          "etcd_server_tls_items": {
            "ca_cert" : "/etc/ssl/certs/etcdca/ca.pem",
            "tls_cert": "/etc/ssl/certs/etcdca/client.pem",
            "tls_key": "/etc/ssl/certs/etcdca/client.key.enc",
            "tls_passwd": "/etc/ssl/certs/etcdca/key_pwd.txt",
            "kmc_ksf_master": "/etc/ssl/certs/etcdca/tools/pmt/master/ksfa",
            "kmc_ksf_standby": "/etc/ssl/certs/etcdca/tools/pmt/standby/ksfb",
            "tls_crl": "/etc/ssl/certs/etcdca/server_crl.pem"
          }
        }
      },
    ...
    }
  3. 在user_config.json配置文件中开启允许部署两个Coordinator节点参数,配置参数如下所示。
    ...
            "coordinator_backup_cfg": {
              "function_enable": true
            },
    ...
    • false:关闭;
    • true:开启。
  4. 执行以下命令启动MindIE。
    1
    python deploy_ac_job.py
    

    可通过查询对应节点日志判断主备节点,如果日志中出现”[TryRenewLease]Renewed lease“,表示当前节点抢到ETCD分布式锁,为主节点。

  5. 发送请求验证服务是否启动成功。

    有以下两种方式发送请求:

    • Coordinator的虚拟IP和端口号方式:https://PodIP:1025。(其中仅有READY为1的才可执行推理请求)
    • K8s集群内任意物理机IP:31015。(该端口号需与coordinator.yaml文件中的“service nodePort”参数值保持一致)
    该样例使用物理机IP和端口号方式:
    #!/bin/bash
    url="https://{物理机IP地址}:31015/infer"
    data='{
        "inputs": "My name is Olivier and I",
        "stream": true,
        "parameters": {
            "max_new_tokens": 10
        }
    }'
    curl -i -L -H "Content-Type: application/json" -X POST --data "$data"  -w '%{http_code}\n'  \
            --cert   /home/ras/public/clusterD-ca/client.pem   \
            --key /home/ras/public/clusterD-ca/client.key.pem \
            --cacert /home/ras/public/clusterD-ca/ca.pem \
            --pass 1234qwer \
            $url

    回显如下所示表示服务启动成功:

    HTTPS/1.1 200 OK
    Server: MindIE-MS
    Content-Type: text/event-stream; charset=utf-8
    Transfer-Encoding: chunked
    data: {"prefill_time":470,"decode_time":null,"token":{"id":4571,"text":"'m"}}
    data: {"prefill_time":null,"decode_time":102,"token":{"id":260,"text":" a"}}
    data: {"prefill_time":null,"decode_time":46,"token":{"id":223,"text":" "}}
    data: {"prefill_time":null,"decode_time":23,"token":{"id":1737,"text":"25 years"}}
    data: {"prefill_time":null,"decode_time":23,"token":{"id":9916,"text":" old boy"}}
    data: {"prefill_time":null,"decode_time":23,"token":{"id":30609,"text":" from Switzerland"}}
    data: {"prefill_time":null,"decode_time":23,"generated_text":"'m a 25 years old boy from Switzerland.","details":null,"token":{"id":16,"text":null}}
    200