昇腾社区首页
中文
注册

使用样例

限制与约束

  • 主、备Coordinator节点和HAProxy镜像需部署在不同的三个节点上。
  • 当前仅支持两个Coordinator部署在两个通算服务器上的场景,其余场景下开启Coordinator主备倒换功能时,Coordinator的Pod无法正常使用。

准备证书

集群内不同POD间通信,建议使用CA证书做双向认证,证书配置请参考如下步骤。

如果不使用CA证书做双向认证加密通信,则服务间将进行明文传输,可能会存在较高的网络安全风险。

  1. 请用户自行准备证书生成的相关前置文件,文件放置目录以/home/{用户名}/auto_gen_ms_cert为例。
    • server.cnf
      [req] # 主要请求内容
      req_extensions = v3_req
      distinguished_name = req_distinguished_name
      
      [req_distinguished_name] # 证书主体信息
      countryName = CN
      stateOrProvinceName = State
      localityName = City
      organizationName = Organization
      organizationalUnitName = Unit
      commonName = coordinator-server
      
      [v3_req] # 核心属性
      basicConstraints = CA:FALSE
      keyUsage = digitalSignature, keyEncipherment
      extendedKeyUsage = serverAuth, clientAuth
      subjectAltName = @alt_names
      
      [alt_names] # 服务标识
      DNS.1 = {自定义域名}
      IP.2 = {节点IP}
    • gen_coordinator_ca.sh
      ```bash
      #!/bin/bash
      
      # 1. 创建CA证书
      openssl genrsa -out ca.key 2048
      openssl req -x509 -new -nodes -key ca.key \
      -subj "/CN=my-cluster-ca" \
      -days 3650 -out ca.pem
      
      # 2. 生成服务端证书
      openssl genrsa -out server.key 2048
      openssl req -new -key server.key -out server.csr \
      -subj "/CN=coordinator-server" -config server.cnf
      openssl x509 -req -in server.csr -CA ca.pem -CAkey ca.key -CAcreateserial \
      -out server.pem -days 3650 -extensions v3_req -extfile server.cnf
      
      # 3. 设置权限
      chmod 0400 ./*.key
      chmod 0400 ./*.pem
  2. 执行以下命令运行gen_coordinator_ca.sh,生成CA证书、服务端证书等文件。
    bash gen_coordinator_ca.sh

    回显类似如下内容表示生成成功:

    Certificate request self-signature ok
    subject=CN = coordinator-server

    运行完成后,在当前目录生成以下文件:

    ca.key
    ca.pem
    ca.srl
    server.cnf
    server_crl.pem
    server.csr
    server.key
    server.pem

操作步骤

  1. 准备HAProxy镜像。
    • HAProxy:必装;用于IP转发,目前适配1.8以上版本,建议使用2.x的稳定版本。
    • jq:必装;用于处理json字符串,在更新Coordinator配置时需要处理json字符串。
    • curl环境:必装;用于在命令行中传输数据。
    1. 查询环境架构信息。
      cat /etc/os-release
      uname -m
    2. 使用docker命令拉取镜像。
      docker pull --platform <操作系统架构> haproxy:<tag>
    3. 自行创建安装jq和curl的Dockerfile文件,文件内容如下所示(以Ubuntu架构为例)。
      FROM haproxy:<tag>
      USER root
      RUN apt-get update && apt-get install -y curl jq
      • Ubuntu:
        apt-get update && apt-get install -y curl jq
      • openEuler:
        dnf update && dnf install curl jq
    4. 在Dockerfile文件所在目录执行以下命令制作带jq和crul工具的HAProxy镜像。
      @自定义HAProxy版本
      docker build -t haproxy:<tag> .
    5. 使用kubectl label node {node_name} {label_key}={label_value}命令给上传了HAProxy镜像的节点打标签,以下示例为给名为worker1的节点打上proxyType为haproxy的标签。
      kubectl label node worker1 proxyType=haproxy
  2. 自行准备HAProxy的yaml文件。

    实现主备倒换功能需要三个yaml配置文件,分别为负责配置转发的haproxy_init.yaml、负责健康查询的haproxy_monitor.yaml与配置SSL的client-ssl-certs.yaml文件。三个yaml文件均在output/deployment文件夹中,配置示例如下所示。

    • haproxy_init.yaml:
      ---
      # 1.RBAC权限: 允许读取Endpoints资源
      apiVersion: v1
      kind: ServiceAccount
      metadata:
        name: haproxy-monitor-sa
        namespace: mindie
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: Role
      metadata:
        name: endpoint-reader
        namespace: mindie
      rules:
        - apiGroups: [""]
          resources: ["endpoints"]
          verbs: ["get", "list", "watch"]
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: RoleBinding
      metadata:
        name: haproxy-endpoint-reader
        namespace: mindie
      subjects:
        - kind: ServiceAccount
          name: haproxy-monitor-sa
      roleRef:
        kind: Role
        name: endpoint-reader
        apiGroup: rbac.authorization.k8s.io
      ---
      # role.yaml
      apiVersion: rbac.authorization.k8s.io/v1
      kind: Role
      metadata:
        namespace: mindie
        name: configmap-patch-role
      rules:
        - apiGroups: [""]
          resources: ["configmaps"]
          verbs: ["get", "list", "patch", "update"]
      ---
      apiVersion: rbac.authorization.k8s.io/v1
      kind: RoleBinding
      metadata:
        name: haproxy-configmap-patch-binding
        namespace: mindie
      subjects:
        - kind: ServiceAccount
          name: haproxy-monitor-sa
          namespace: mindie
      roleRef:
        kind: Role
        name: configmap-patch-role
        apiGroup: rbac.authorization.k8s.io
      ---
      # 2. ConfigMap: HAProxy 基础配置模板和管控脚本
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: haproxy-config-template
        namespace: mindie
      data:
        haproxy.cfg: |
          global
            log 127.0.0.1 local2
            pidfile /var/run/haproxy.pid
            daemon
            maxconn 12000
            stats socket /var/run/haproxy.sock mode 660 level admin
          defaults
            mode    tcp
            log     global
            option tcplog
            option dontlognull
            timeout queue 3600s
            timeout connect 3s
            timeout client  3600s
            timeout server  3600s
          frontend main
            mode tcp
            bind *:443
            default_backend k8s-worker
          backend k8s-worker
            mode tcp
            server k8s_worker_1 server_target1:1025
      ---
      # 3. Deployment: HAProxy
      apiVersion: apps/v1
      kind: Deployment
      metadata:
        name: haproxy
        namespace: mindie
      spec:
        replicas: 1
        selector:
          matchLabels:
            app: haproxy
        template:
          metadata:
            labels:
              app: haproxy
          spec:
            serviceAccountName: haproxy-monitor-sa
            nodeSelector:
                proxyType: haproxy
            containers:
              - name: haproxy
                image: haproxy:<tag>
                imagePullPolicy: IfNotPresent
                ports:
                  - containerPort: 443
                env:
                  - name: POD_IP
                    valueFrom:
                      fieldRef:
                        fieldPath: status.podIP
                  - name: TARGET_SERVICE
                    value: mindie-ms-coordinator-infer
                command: ["/bin/bash", "-c"]
                args:
                  - "bash /scripts/haproxy_monitor.sh"
                volumeMounts:
                  - name: client-pem
                    mountPath: /etc/ssl/certs
                    readOnly: true
                  - name: host-scripts
                    mountPath: /scripts/
                    readOnly: true
                  - name: haproxy-config
                    mountPath: /usr/local/etc/haproxy/
                  - name: haproxy-sock
                    mountPath: /var/run/
            volumes:
              - name: client-pem
                configMap:
                  name: ssl-certs
                  items:
                    - key: client.key.pem
                      path: client.key.pem
                    - key: client.pem
                      path: client.pem
                    - key: ca.pem
                      path: ca.pem
              - name: host-scripts
                configMap:
                  name: haproxy-monitor
                  items:
                    - key: haproxy_monitor.sh
                      path: haproxy_monitor.sh
              - name: haproxy-config
                configMap:
                  name: haproxy-config-template
                  items:
                    - key: haproxy.cfg
                      path: haproxy.cfg
              - name: haproxy-sock
                emptyDir: {}
      ---
      apiVersion: v1
      kind: Service
      metadata:
        name: haproxy-service
        namespace: mindie
      spec:
        type: NodePort
        selector:
          app: haproxy
        ports:
          - name: main
            protocol: TCP
            port: 443
            targetPort: 443
            nodePort: 31443

      文件中加粗内容为关键参数,其解释如下所示:

      • “namespace”:与MindIE MS Coordinator的coordinator_init.yaml文件中“namespace”参数的值保持一致。
      • “bind *”“containerPort”“port”“targetPort”:加密通信参数,取值为:443。
      • “nodePort”:加密通信参数,取值为:31443。
      • “image”:需替换<tag>标签为1.d中自定义的版本。
      • “nodeSelector”:需配置为1.e中打的标签。
    • haproxy_monitor.yaml:
      apiVersion: v1
      kind: ConfigMap
      metadata:
        name: haproxy-monitor
        namespace: mindie
      data:
        haproxy_monitor.sh: |
          #!/bin/bash
          APISERVER="https://kubernetes.default.svc"
          CA_CERT="/var/run/secrets/kubernetes.io/serviceaccount/ca.crt"
          TOKEN_PATH="/var/run/secrets/kubernetes.io/serviceaccount/token"
          NAMESPACE="mindie"
          SERVICE_NAME="mindie-ms-coordinator-infer"
          CONFIGMAP_NAME="haproxy-config-template"
          LAST_0_IP="server_target1"
          current_ip="server_target1"
          mkdir -p /etc/haproxy
          ENDPOINT_0_IP="127.0.0.1"
          ENDPOINT_1_IP="127.0.0.1"
          check_url() {
            local url=$1
            local fail_count=0
            for ((i=1; i<=3; i++)); do
              http_code=$(
                curl -s -o /dev/null \
                -w "%{http_code}" \
                --max-time 3 \
                --cert /etc/ssl/certs/client.pem \
                --key /etc/ssl/certs/client.key.pem \
                --cacert /etc/ssl/certs/ca.pem \
                --pass 1234qwer \
                "$url"
              )
              if [[ $http_code != 200 ]]; then
                ((fail_count++))
              else
                return 0
              fi
            done
            return 1
          }
          has_active_bak() {
            local LOCAL_ENDPOINT_0_IP=$ENDPOINT_0_IP
            local LOCAL_ENDPOINT_1_IP=$ENDPOINT_1_IP
            if check_url "https://${LOCAL_ENDPOINT_0_IP}:1026/v1/health"; then
              echo "replace server.${LOCAL_ENDPOINT_0_IP}"
              current_ip="${LOCAL_ENDPOINT_0_IP}"
              return 0
            else
              if check_url "https://${LOCAL_ENDPOINT_1_IP}:1026/v1/health"; then
                echo "replace server.${LOCAL_ENDPOINT_1_IP}"
                current_ip="${LOCAL_ENDPOINT_1_IP}"
                return 0
              else
                echo "no server active.wating..."
                return 1
              fi
            fi
          }
          get_coordinator_ip() {
            # 1.获取Endpoint的数据
            TOKEN=$(cat ${TOKEN_PATH})
            ENDPOINTS_JSON=$(curl -sS --cacert $CA_CERT -H "Authorization: Bearer $TOKEN" \
              ${APISERVER}/api/v1/namespaces/${NAMESPACE}/endpoints/${SERVICE_NAME})
            echo $ENDPOINTS_JSON
            # 2.解析Endpoint数据,获取IP
            IP_APPLY=$(echo $ENDPOINTS_JSON | jq -r '.subsets[].addresses[].ip')
            ENDPOINT_0_IP=$(echo $ENDPOINTS_JSON | jq -r '.subsets[].addresses[0].ip // "127.0.0.1"')
            echo $ENDPOINT_0_IP
            ENDPOINT_1_IP=$(echo $ENDPOINTS_JSON | jq -r '.subsets[].addresses[1].ip // "127.0.0.1"')
            echo $ENDPOINT_1_IP
          }
          while true; do
            if [[ $current_ip == "server_target1" ]]; then
              echo "init..."
              get_coordinator_ip
              if ! has_active_bak; then
                sleep 2
                continue
              fi
            else
              if check_url "https://${current_ip}:1026/v1/health"; then
                echo "current server.${current_ip} is active"
                sleep 2
                continue
              else
                echo "current server.${current_ip} is not active"
                get_coordinator_ip
                if ! has_active_bak; then
                  sleep 2
                  continue
                fi
              fi
            fi
            cp /usr/local/etc/haproxy/haproxy.cfg /etc/haproxy/temp
            if [[ ! -f /etc/haproxy/haproxy.cfg ]]; then
              cp /etc/haproxy/temp /etc/haproxy/haproxy.cfg
            fi
            sed -i "s/${LAST_0_IP}/${current_ip}/g" /etc/haproxy/temp
            if ! diff -q /etc/haproxy/temp /etc/haproxy/haproxy.cfg &>/dev/null; then
              echo "temp:"
              cat $(cat /etc/haproxy/temp)
              echo "-----------"
              echo "haproxy.cfg:"
              cat $(cat /etc/haproxy/haproxy.cfg)
              echo "-----------"
              echo "config has changed, updating haproxy..."
              cp /etc/haproxy/temp /etc/haproxy/haproxy.cfg
              haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid)
              echo "haproxy updated success"
            else
              echo "config is not changed,skipping update."
            fi
            rm -f /etc/haproxy/temp
            sleep 2
          done

      文件中加粗内容为关键参数,其解释如下所示:

      • “namespace”“NAMESPACE”:与MindIE MS Coordinator的coordinator_init.yaml文件中“namespace”参数的值保持一致。
      • “SERVICE_NAME”:与MindIE MS Coordinator的coordinator_init.yaml文件中“name”参数的值保持一致。
    • client-ssl-certs.yaml
      apiVersion: v1
      kind: ConfigMap
      metadata:
        namespace: mindie
        name: ssl-certs
      data:
        client.key.pem: |
          -----BEGIN PRIVATE KEY-----
          <server.key>
          -----END PRIVATE KEY-----
        client.pem: |
          -----BEGIN CERTIFICATE-----
          <server.pem>
          -----END CERTIFICATE-----
        ca.pem: |
          -----BEGIN CERTIFICATE-----
          <ca.pem>
          -----END CERTIFICATE-----
  3. 配置证书。
    1. 准备证书中生成服务端证书的内容复制至deployment/client-ssl-certs.yaml文件中。
      下面以DNS.1 = haproxy-service.ascend.com为例:
      apiVersion: v1
      kind: ConfigMap
      metadata:
        namespace: mindie
        name: ssl-certs
      data:
        client.key.pem: |
          -----BEGIN PRIVATE KEY-----
          <server.key>
          -----END PRIVATE KEY-----
        client.pem: |
          -----BEGIN CERTIFICATE-----
          <server.pem>
          -----END CERTIFICATE-----
        ca.pem: |
          -----BEGIN CERTIFICATE-----
          <ca.pem>
          -----END CERTIFICATE-----
      • client.key.pem: <server.key>需替换为server.key文件中的内容。
      • client.pem:<server.pem>需替换为server.pem文件中的内容。
      • ca.pem: <ca.pem>需替换为ca.pem文件中的内容。
    2. 修改配置文件/ect/hosts。
      设置访问域名到服务器节点,需要保证SAN包含访问HAProxy时所用的域名或节点IP,以DNS.1 = haproxy-service.ascend.com为例,将以下内容追加至hosts文件中。
      x.x.x.x haproxy-service.ascend.com

      x.x.x.x:为1中server.cnf文件“IP.2”参数配置的节点IP。

  4. 在user_config.json配置文件中开启允许部署两个Coordinator节点参数,配置参数如下所示。
    ...
            "coordinator_backup_cfg": {
              "function_enable": true
            },
    ...
    • false:关闭;
    • true:开启。
  5. 按照以下顺序启动HAProxy镜像。
    1. 启动client-ssl-certs.yaml
      kubectl apply -f client-ssl-certs.yaml
    2. 启动haproxy_monitor.yaml
      kubectl apply -f haproxy_monitor.yaml
    3. 启动haproxy_init.yaml
      kubectl apply -f haproxy_init.yaml
  6. 执行以下命令启动MindIE。
    python deploy_ac_job.py
  7. 发送请求验证服务是否启动成功。

    有以下三种方式发送请求:

    • HAProxy的虚拟IP和端口号方式:https://PodIP:Port。
    • 物理机IP:31443。(物理机IP为1中server.cnf文件“IP.2”参数配置的节点IP)
    • 域名方式:https://haproxy-service.ascend.com:31443。(haproxy-service.ascend.com为DNS.1中自定义的名称)
    该样例使用HAProxy的虚拟IP和端口号方式:
    #!/bin/bash
    url="https://{服务IP地址}:{端口号}/infer"
    data='{
        "inputs": "My name is Olivier and I",
        "stream": true,
        "parameters": {
            "max_new_tokens": 10
        }
    }'
    curl -i -L -H "Content-Type: application/json" -X POST --data "$data"  -w '%{http_code}\n'  \
            --cert   /home/ras/public/clusterD-ca/client.pem   \
            --key /home/ras/public/clusterD-ca/client.key.pem \
            --cacert /home/ras/public/clusterD-ca/ca.pem \
            --pass 1234qwer \
            $url

    回显如下所示表示服务启动成功:

    HTTPS/1.1 200 OK
    Server: MindIE-MS
    Content-Type: text/event-stream; charset=utf-8
    Transfer-Encoding: chunked
    data: {"prefill_time":470,"decode_time":null,"token":{"id":4571,"text":"'m"}}
    data: {"prefill_time":null,"decode_time":102,"token":{"id":260,"text":" a"}}
    data: {"prefill_time":null,"decode_time":46,"token":{"id":223,"text":" "}}
    data: {"prefill_time":null,"decode_time":23,"token":{"id":1737,"text":"25 years"}}
    data: {"prefill_time":null,"decode_time":23,"token":{"id":9916,"text":" old boy"}}
    data: {"prefill_time":null,"decode_time":23,"token":{"id":30609,"text":" from Switzerland"}}
    data: {"prefill_time":null,"decode_time":23,"generated_text":"'m a 25 years old boy from Switzerland.","details":null,"token":{"id":16,"text":null}}
    200