使用样例
限制与约束
- 主、备Coordinator节点和HAProxy镜像需部署在不同的三个节点上。
- 当前仅支持两个Coordinator部署在两个通算服务器上的场景,其余场景下开启Coordinator主备倒换功能时,Coordinator的Pod无法正常使用。
准备证书
集群内不同POD间通信,建议使用CA证书做双向认证,证书配置请参考如下步骤。

如果不使用CA证书做双向认证加密通信,则服务间将进行明文传输,可能会存在较高的网络安全风险。
- 请用户自行准备证书生成的相关前置文件,文件放置目录以/home/{用户名}/auto_gen_ms_cert为例。
- server.cnf
[req] # 主要请求内容 req_extensions = v3_req distinguished_name = req_distinguished_name [req_distinguished_name] # 证书主体信息 countryName = CN stateOrProvinceName = State localityName = City organizationName = Organization organizationalUnitName = Unit commonName = coordinator-server [v3_req] # 核心属性 basicConstraints = CA:FALSE keyUsage = digitalSignature, keyEncipherment extendedKeyUsage = serverAuth, clientAuth subjectAltName = @alt_names [alt_names] # 服务标识 DNS.1 = {自定义域名} IP.2 = {节点IP}
- gen_coordinator_ca.sh
```bash #!/bin/bash # 1. 创建CA证书 openssl genrsa -out ca.key 2048 openssl req -x509 -new -nodes -key ca.key \ -subj "/CN=my-cluster-ca" \ -days 3650 -out ca.pem # 2. 生成服务端证书 openssl genrsa -out server.key 2048 openssl req -new -key server.key -out server.csr \ -subj "/CN=coordinator-server" -config server.cnf openssl x509 -req -in server.csr -CA ca.pem -CAkey ca.key -CAcreateserial \ -out server.pem -days 3650 -extensions v3_req -extfile server.cnf # 3. 设置权限 chmod 0400 ./*.key chmod 0400 ./*.pem
- server.cnf
- 执行以下命令运行gen_coordinator_ca.sh,生成CA证书、服务端证书等文件。
bash gen_coordinator_ca.sh
回显类似如下内容表示生成成功:
Certificate request self-signature ok subject=CN = coordinator-server
运行完成后,在当前目录生成以下文件:
ca.key ca.pem ca.srl server.cnf server_crl.pem server.csr server.key server.pem
操作步骤
- 准备HAProxy镜像。
- HAProxy:必装;用于IP转发,目前适配1.8以上版本,建议使用2.x的稳定版本。
- jq:必装;用于处理json字符串,在更新Coordinator配置时需要处理json字符串。
- curl环境:必装;用于在命令行中传输数据。
- 查询环境架构信息。
cat /etc/os-release uname -m
- 使用docker命令拉取镜像。
docker pull --platform <操作系统架构> haproxy:<tag>
- 自行创建安装jq和curl的Dockerfile文件,文件内容如下所示(以Ubuntu架构为例)。
FROM haproxy:<tag> USER root RUN apt-get update && apt-get install -y curl jq
- Ubuntu:
apt-get update && apt-get install -y curl jq
- openEuler:
dnf update && dnf install curl jq
- Ubuntu:
- 在Dockerfile文件所在目录执行以下命令制作带jq和crul工具的HAProxy镜像。
@自定义HAProxy版本 docker build -t haproxy:<tag> .
- 使用kubectl label node {node_name} {label_key}={label_value}命令给上传了HAProxy镜像的节点打标签,以下示例为给名为worker1的节点打上proxyType为haproxy的标签。
kubectl label node worker1 proxyType=haproxy
- 自行准备HAProxy的yaml文件。
实现主备倒换功能需要三个yaml配置文件,分别为负责配置转发的haproxy_init.yaml、负责健康查询的haproxy_monitor.yaml与配置SSL的client-ssl-certs.yaml文件。三个yaml文件均在output/deployment文件夹中,配置示例如下所示。
- haproxy_init.yaml:
--- # 1.RBAC权限: 允许读取Endpoints资源 apiVersion: v1 kind: ServiceAccount metadata: name: haproxy-monitor-sa namespace: mindie --- apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: name: endpoint-reader namespace: mindie rules: - apiGroups: [""] resources: ["endpoints"] verbs: ["get", "list", "watch"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: haproxy-endpoint-reader namespace: mindie subjects: - kind: ServiceAccount name: haproxy-monitor-sa roleRef: kind: Role name: endpoint-reader apiGroup: rbac.authorization.k8s.io --- # role.yaml apiVersion: rbac.authorization.k8s.io/v1 kind: Role metadata: namespace: mindie name: configmap-patch-role rules: - apiGroups: [""] resources: ["configmaps"] verbs: ["get", "list", "patch", "update"] --- apiVersion: rbac.authorization.k8s.io/v1 kind: RoleBinding metadata: name: haproxy-configmap-patch-binding namespace: mindie subjects: - kind: ServiceAccount name: haproxy-monitor-sa namespace: mindie roleRef: kind: Role name: configmap-patch-role apiGroup: rbac.authorization.k8s.io --- # 2. ConfigMap: HAProxy 基础配置模板和管控脚本 apiVersion: v1 kind: ConfigMap metadata: name: haproxy-config-template namespace: mindie data: haproxy.cfg: | global log 127.0.0.1 local2 pidfile /var/run/haproxy.pid daemon maxconn 12000 stats socket /var/run/haproxy.sock mode 660 level admin defaults mode tcp log global option tcplog option dontlognull timeout queue 3600s timeout connect 3s timeout client 3600s timeout server 3600s frontend main mode tcp bind *:443 default_backend k8s-worker backend k8s-worker mode tcp server k8s_worker_1 server_target1:1025 --- # 3. Deployment: HAProxy apiVersion: apps/v1 kind: Deployment metadata: name: haproxy namespace: mindie spec: replicas: 1 selector: matchLabels: app: haproxy template: metadata: labels: app: haproxy spec: serviceAccountName: haproxy-monitor-sa nodeSelector: proxyType: haproxy containers: - name: haproxy image: haproxy:<tag> imagePullPolicy: IfNotPresent ports: - containerPort: 443 env: - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP - name: TARGET_SERVICE value: mindie-ms-coordinator-infer command: ["/bin/bash", "-c"] args: - "bash /scripts/haproxy_monitor.sh" volumeMounts: - name: client-pem mountPath: /etc/ssl/certs readOnly: true - name: host-scripts mountPath: /scripts/ readOnly: true - name: haproxy-config mountPath: /usr/local/etc/haproxy/ - name: haproxy-sock mountPath: /var/run/ volumes: - name: client-pem configMap: name: ssl-certs items: - key: client.key.pem path: client.key.pem - key: client.pem path: client.pem - key: ca.pem path: ca.pem - name: host-scripts configMap: name: haproxy-monitor items: - key: haproxy_monitor.sh path: haproxy_monitor.sh - name: haproxy-config configMap: name: haproxy-config-template items: - key: haproxy.cfg path: haproxy.cfg - name: haproxy-sock emptyDir: {} --- apiVersion: v1 kind: Service metadata: name: haproxy-service namespace: mindie spec: type: NodePort selector: app: haproxy ports: - name: main protocol: TCP port: 443 targetPort: 443 nodePort: 31443
文件中加粗内容为关键参数,其解释如下所示:
- haproxy_monitor.yaml:
apiVersion: v1 kind: ConfigMap metadata: name: haproxy-monitor namespace: mindie data: haproxy_monitor.sh: | #!/bin/bash APISERVER="https://kubernetes.default.svc" CA_CERT="/var/run/secrets/kubernetes.io/serviceaccount/ca.crt" TOKEN_PATH="/var/run/secrets/kubernetes.io/serviceaccount/token" NAMESPACE="mindie" SERVICE_NAME="mindie-ms-coordinator-infer" CONFIGMAP_NAME="haproxy-config-template" LAST_0_IP="server_target1" current_ip="server_target1" mkdir -p /etc/haproxy ENDPOINT_0_IP="127.0.0.1" ENDPOINT_1_IP="127.0.0.1" check_url() { local url=$1 local fail_count=0 for ((i=1; i<=3; i++)); do http_code=$( curl -s -o /dev/null \ -w "%{http_code}" \ --max-time 3 \ --cert /etc/ssl/certs/client.pem \ --key /etc/ssl/certs/client.key.pem \ --cacert /etc/ssl/certs/ca.pem \ --pass 1234qwer \ "$url" ) if [[ $http_code != 200 ]]; then ((fail_count++)) else return 0 fi done return 1 } has_active_bak() { local LOCAL_ENDPOINT_0_IP=$ENDPOINT_0_IP local LOCAL_ENDPOINT_1_IP=$ENDPOINT_1_IP if check_url "https://${LOCAL_ENDPOINT_0_IP}:1026/v1/health"; then echo "replace server.${LOCAL_ENDPOINT_0_IP}" current_ip="${LOCAL_ENDPOINT_0_IP}" return 0 else if check_url "https://${LOCAL_ENDPOINT_1_IP}:1026/v1/health"; then echo "replace server.${LOCAL_ENDPOINT_1_IP}" current_ip="${LOCAL_ENDPOINT_1_IP}" return 0 else echo "no server active.wating..." return 1 fi fi } get_coordinator_ip() { # 1.获取Endpoint的数据 TOKEN=$(cat ${TOKEN_PATH}) ENDPOINTS_JSON=$(curl -sS --cacert $CA_CERT -H "Authorization: Bearer $TOKEN" \ ${APISERVER}/api/v1/namespaces/${NAMESPACE}/endpoints/${SERVICE_NAME}) echo $ENDPOINTS_JSON # 2.解析Endpoint数据,获取IP IP_APPLY=$(echo $ENDPOINTS_JSON | jq -r '.subsets[].addresses[].ip') ENDPOINT_0_IP=$(echo $ENDPOINTS_JSON | jq -r '.subsets[].addresses[0].ip // "127.0.0.1"') echo $ENDPOINT_0_IP ENDPOINT_1_IP=$(echo $ENDPOINTS_JSON | jq -r '.subsets[].addresses[1].ip // "127.0.0.1"') echo $ENDPOINT_1_IP } while true; do if [[ $current_ip == "server_target1" ]]; then echo "init..." get_coordinator_ip if ! has_active_bak; then sleep 2 continue fi else if check_url "https://${current_ip}:1026/v1/health"; then echo "current server.${current_ip} is active" sleep 2 continue else echo "current server.${current_ip} is not active" get_coordinator_ip if ! has_active_bak; then sleep 2 continue fi fi fi cp /usr/local/etc/haproxy/haproxy.cfg /etc/haproxy/temp if [[ ! -f /etc/haproxy/haproxy.cfg ]]; then cp /etc/haproxy/temp /etc/haproxy/haproxy.cfg fi sed -i "s/${LAST_0_IP}/${current_ip}/g" /etc/haproxy/temp if ! diff -q /etc/haproxy/temp /etc/haproxy/haproxy.cfg &>/dev/null; then echo "temp:" cat $(cat /etc/haproxy/temp) echo "-----------" echo "haproxy.cfg:" cat $(cat /etc/haproxy/haproxy.cfg) echo "-----------" echo "config has changed, updating haproxy..." cp /etc/haproxy/temp /etc/haproxy/haproxy.cfg haproxy -D -f /etc/haproxy/haproxy.cfg -p /var/run/haproxy.pid -sf $(cat /var/run/haproxy.pid) echo "haproxy updated success" else echo "config is not changed,skipping update." fi rm -f /etc/haproxy/temp sleep 2 done
文件中加粗内容为关键参数,其解释如下所示:
- “namespace”、 “NAMESPACE”:与MindIE MS Coordinator的coordinator_init.yaml文件中“namespace”参数的值保持一致。
- “SERVICE_NAME”:与MindIE MS Coordinator的coordinator_init.yaml文件中“name”参数的值保持一致。
- client-ssl-certs.yaml
apiVersion: v1 kind: ConfigMap metadata: namespace: mindie name: ssl-certs data: client.key.pem: | -----BEGIN PRIVATE KEY----- <server.key> -----END PRIVATE KEY----- client.pem: | -----BEGIN CERTIFICATE----- <server.pem> -----END CERTIFICATE----- ca.pem: | -----BEGIN CERTIFICATE----- <ca.pem> -----END CERTIFICATE-----
- haproxy_init.yaml:
- 配置证书。
- 将准备证书中生成服务端证书的内容复制至deployment/client-ssl-certs.yaml文件中。下面以DNS.1 = haproxy-service.ascend.com为例:
apiVersion: v1 kind: ConfigMap metadata: namespace: mindie name: ssl-certs data: client.key.pem: | -----BEGIN PRIVATE KEY----- <server.key> -----END PRIVATE KEY----- client.pem: | -----BEGIN CERTIFICATE----- <server.pem> -----END CERTIFICATE----- ca.pem: | -----BEGIN CERTIFICATE----- <ca.pem> -----END CERTIFICATE-----
- client.key.pem: <server.key>需替换为server.key文件中的内容。
- client.pem:<server.pem>需替换为server.pem文件中的内容。
- ca.pem: <ca.pem>需替换为ca.pem文件中的内容。
- 修改配置文件/ect/hosts。设置访问域名到服务器节点,需要保证SAN包含访问HAProxy时所用的域名或节点IP,以DNS.1 = haproxy-service.ascend.com为例,将以下内容追加至hosts文件中。
x.x.x.x haproxy-service.ascend.com
x.x.x.x:为1中server.cnf文件“IP.2”参数配置的节点IP。
- 将准备证书中生成服务端证书的内容复制至deployment/client-ssl-certs.yaml文件中。
- 在user_config.json配置文件中开启允许部署两个Coordinator节点参数,配置参数如下所示。
... "coordinator_backup_cfg": { "function_enable": true }, ...
- false:关闭;
- true:开启。
- 按照以下顺序启动HAProxy镜像。
- 启动client-ssl-certs.yaml
kubectl apply -f client-ssl-certs.yaml
- 启动haproxy_monitor.yaml
kubectl apply -f haproxy_monitor.yaml
- 启动haproxy_init.yaml
kubectl apply -f haproxy_init.yaml
- 启动client-ssl-certs.yaml
- 执行以下命令启动MindIE。
python deploy_ac_job.py
- 发送请求验证服务是否启动成功。
有以下三种方式发送请求:
- HAProxy的虚拟IP和端口号方式:https://PodIP:Port。
- 物理机IP:31443。(物理机IP为1中server.cnf文件“IP.2”参数配置的节点IP)
- 域名方式:https://haproxy-service.ascend.com:31443。(haproxy-service.ascend.com为DNS.1中自定义的名称)
该样例使用HAProxy的虚拟IP和端口号方式:#!/bin/bash url="https://{服务IP地址}:{端口号}/infer" data='{ "inputs": "My name is Olivier and I", "stream": true, "parameters": { "max_new_tokens": 10 } }' curl -i -L -H "Content-Type: application/json" -X POST --data "$data" -w '%{http_code}\n' \ --cert /home/ras/public/clusterD-ca/client.pem \ --key /home/ras/public/clusterD-ca/client.key.pem \ --cacert /home/ras/public/clusterD-ca/ca.pem \ --pass 1234qwer \ $url
回显如下所示表示服务启动成功:
HTTPS/1.1 200 OK Server: MindIE-MS Content-Type: text/event-stream; charset=utf-8 Transfer-Encoding: chunked data: {"prefill_time":470,"decode_time":null,"token":{"id":4571,"text":"'m"}} data: {"prefill_time":null,"decode_time":102,"token":{"id":260,"text":" a"}} data: {"prefill_time":null,"decode_time":46,"token":{"id":223,"text":" "}} data: {"prefill_time":null,"decode_time":23,"token":{"id":1737,"text":"25 years"}} data: {"prefill_time":null,"decode_time":23,"token":{"id":9916,"text":" old boy"}} data: {"prefill_time":null,"decode_time":23,"token":{"id":30609,"text":" from Switzerland"}} data: {"prefill_time":null,"decode_time":23,"generated_text":"'m a 25 years old boy from Switzerland.","details":null,"token":{"id":16,"text":null}} 200
父主题: Coordinator主备倒换