使用样例
限制与约束
提前在MindIE MS Controller的ms_controller.json配置文件中配置CCAE管控平台的IP地址和端口信息,如需使用日志上报功能还需配置证书信息。
生成OM Adapter对接CCAE证书
OM Adapter对接CCAE可自行选择是否需要配置证书,如需配置证书,请参考以下操作步骤生成证书。如不需要配置证书,请参见1登录CCAE管控平台,在证书认证配置页面关闭以MindIE命名开头的三个配置项,并直接参见配置OM Adapter对接CCAE管控平台进行OM Adapter对接CCAE管控平台。
- 打开CCAE证书配置开关。
- 获取CA证书文件,即为MindIE MS Controller的ms_controller.json配置文件中"ccae_tls_items"字段下"ca_cert"参数所需要配置的文件路径。
- 在CCAE业务面主菜单中选择图3所示。 ,进入CA证书管理界面,如
- 单击“下载CA证书”下载CA证书,得到一个.pem文件,即为不带信任链的CA证书。图4 下载CA证书
- 在CCAE业务面主菜单中选择“服务列表”页面获取CA证书信任链,如图5所示。 ,进入
- 选择“CCAESouthBoundService”,进入“CCAESouthBoundService证书列表”界面。图6 CCAESouthBoundService
- 单击右上方“导出信任链”按钮,得到一个.pem文件,即为CA证书信任链文件。图7 导出信任链
- 复制CA证书信任链文件中的内容粘贴到CA证书文件后面,即将两个文件内容合并为一个.pem文件,即为完整CA证书文件。
- 获取tls证书文件和tls私钥,即为MindIE MS Controller的ms_controller.json配置文件中"ccae_tls_items"字段下"tls_cert"和"tls_key"参数所需要配置的文件路径。
- 在CCAE业务面主菜单中选择“CA服务”界面,如图8所示。 ,进入
- 在左侧导航栏中选择“证书申请”界面,在“基本信息申请证书”页签输入以下内容,然后提交申请,如图9所示。
- 关联CA:为2.b中下载的CA名称。
- 证书模板:选择“CCAGENT_ENTITY_60YEARS”。
- 公共名称(CN):用户自定义名称。
,进入 - 在左侧导航栏中选择 ,进入“证书管理”界面,如图10所示。
选择需要下载的证书,单击后方的“下载”按钮,在“下载证书”弹窗中填充以下信息,然后单击“提交”进行下载。
- 文件类型:选择“证书”。
- 文件名:需要用户自定义,如“tls_cert”。
- 文件格式:选择“PEM(.pem)”。
- 在3.c中“下载证书”弹窗中填充以下信息,然后单击“提交”进行下载即可获取tls私钥,如图11所示。
- 文件类型:选择“私钥”。
- 文件名:需要用户自定义,如“tls_key”。
- 文件格式:选择“PKCS#8(.pem)”。
- 文件口令:需要用户自定义。
- 获取密文数据,即为MindIE MS Controller的ms_controller.json配置文件中"ccae_tls_items"字段下"tls_passwd"参数所需要配置的文件路径。该操作需要在MindIE的容器中执行,MindIE镜像中包含生成密文的工具,详情请参见encrypt。
- 使用以下命令进入MindIE Service的安装目录。
cd /usr/local/Ascend/mindie/latest/mindie-service
- 使用以下命令生成密文。
./bin/seceasy_encrypt --encrypt 1 2
- 根据回显提示输入3.d获取tls私钥文件时配置的口令,即可在encrypted:后方生成密文,如下所示。
root@mindie-controller-master-0:/usr/local/Ascend/mindie/latest/mindie-service# ./bin/seceasy_encrypt --encrypt 1 2 please input the password to encrypt please input the password to encrypt again encrypted: AAAAAgAAAAAAAAABAAAQAAAl+zJLrsq6Bduk6QPIWUNIsc+DyyOnVmy2xtrh2AAAAAEAAAAEAAAAAAAAAGXqKpZ+ZKbuFdCHBZH9ZYCYOTBTbxyIlRQ=
- 自行创建一个.txt文件,将encrypted:后方生成密文复制至该.txt文件中即可。
- 使用以下命令进入MindIE Service的安装目录。
- 4执行完成后,会在当前目录生成一个/tools文件夹,该文件夹中的/tools/pmt/master/ksfa和/tools/pmt/standby/ksfb即为MindIE MS Controller的ms_controller.json配置文件中"ccae_tls_items"字段下"kmc_ksf_master"和"kmc_ksf_standby"参数所需要配置的文件路径。
- 将2~5中获取的文件或目录放在物理机同一目录下,以/mnt为例,如果环境中没有该目录,请使用以下自行创建。
mkdir /mnt
配置OM Adapter对接CCAE管控平台
- 在MindIE MS Controller的ms_controller.json配置文件中自行配置CCAE管控平台的IP地址和端口信息,CCAE相关配置参数为以下加粗内容。
{ "allow_all_zero_ip_listening": false, "deploy_mode": "pd_separate", "initial_dist_server_port": 10000, "cluster_port":8899, "process_manager" : { "to_file": true, "file_path": "./logs/controller_process_status.json" }, "ccae": { "ip": "xxx.xxx.xxx.xxx", "port": 31948 }, "cluster_status" : { "to_file": true, "file_path": "./logs/cluster_status_output.json" }, ...
- (可选)开启证书校验,其中证书路径需挂载到容器可见路径下。证书生成方式请参见生成OM Adapter对接CCAE证书。
- 将证书路径挂载至容器目录,该操作需要在deployment/controller_init.yaml文件中进行挂载,挂载路径如以下配置文件中加粗内容所示。
apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: mindie-ms-controller #以具体任务为准, xxxx默认mindie-ms namespace: mindie #以MindIE为准,用户可修改 labels: framework: pytorch app: mindie-ms-controller #固定 jobID: xxxx # 推理任务的名称,用户需配置,追加,xxxx默认mindie-ms ring-controller.atlas: ascend-910b spec: schedulerName: volcano # work when enableGangScheduling is true runPolicy: schedulingPolicy: # work when enableGangScheduling is true minAvailable: 1 # 保持和replicas一致 queue: default successPolicy: AllWorkers replicaSpecs: Master: replicas: 1 # controller的副本数 restartPolicy: Always template: metadata: labels: #ring-controller.atlas: ascend-910b app: mindie-ms-controller jobID: xxxx #推理任务的名称,用户需配置,追加,xxxx默认为mindie spec: # 保持默认值 affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: app operator: In values: - mindie-ms-controller topologyKey: kubernetes.io/hostname nodeSelector: # masterselector: dls-master-node accelerator: huawei-Ascend310P # machine-id: "7" terminationGracePeriodSeconds: 0 automountServiceAccountToken: false securityContext: fsGroup: 1001 containers: - image: mindie:dev-2.0.RC1-B087-800I-A2-py311-ubuntu22.04-aarch64 imagePullPolicy: IfNotPresent name: ascend securityContext: # allowPrivilegeEscalation: false privileged: true capabilities: drop: [ "ALL" ] seccompProfile: type: "RuntimeDefault" readinessProbe: exec: command: - bash - -c - "$MIES_INSTALL_PATH/scripts/http_client_ctl/probe.sh readiness" periodSeconds: 5 livenessProbe: exec: command: - bash - -c - "$MIES_INSTALL_PATH/scripts/http_client_ctl/probe.sh liveness" periodSeconds: 5 startupProbe: exec: command: - bash - -c - "$MIES_INSTALL_PATH/scripts/http_client_ctl/probe.sh startup" periodSeconds: 5 failureThreshold: 100 env: - name: POD_IP valueFrom: fieldRef: fieldPath: status.podIP - name: GLOBAL_RANK_TABLE_FILE_PATH value: "/user/serverid/devindex/config/..data/global_ranktable.json" - name: MIES_INSTALL_PATH value: $(MINDIE_USER_HOME_PATH)/Ascend/mindie/latest/mindie-service - name: CONFIG_FROM_CONFIGMAP_PATH value: /mnt/configmap - name: CONTROLLER_LOG_CONFIG_PATH value: /root/mindie envFrom: - configMapRef: name: common-env command: [ "/bin/bash", "-c", " /mnt/configmap/boot.sh; \n " ] resources: requests: memory: "2Gi" cpu: "4" limits: memory: "4Gi" cpu: "8" volumeMounts: - name: global-ranktable mountPath: /user/serverid/devindex/config - name: mindie-http-client-ctl-config mountPath: /mnt/configmap/http_client_ctl.json subPath: http_client_ctl.json - name: python-script-get-group-id mountPath: /mnt/configmap/get_group_id.py subPath: get_group_id.py - name: boot-bash-script mountPath: /mnt/configmap/boot.sh subPath: boot.sh - name: mindie-ms-controller-config mountPath: /mnt/configmap/ms_controller.json subPath: ms_controller.json - name: status-data mountPath: /usr/local/Ascend/mindie/latest/mindie-service/logs - name: localtime mountPath: /etc/localtime - name: mnt mountPath: /mnt volumes: - name: localtime hostPath: path: /etc/localtime - name: global-ranktable configMap: name: global-ranktable defaultMode: 0640 - name: mindie-http-client-ctl-config configMap: name: mindie-http-client-ctl-config defaultMode: 0640 - name: python-script-get-group-id configMap: name: python-script-get-group-id defaultMode: 0640 - name: boot-bash-script configMap: name: boot-bash-script defaultMode: 0550 - name: mindie-ms-controller-config configMap: name: mindie-ms-controller-config defaultMode: 0640 - name: status-data hostPath: path: /data/mindie-ms/status type: Directory - name: mnt hostPath: path: /mnt
- 配置证书校验需要在MindIE MS Controller的ms_controller.json配置文件中配置,如以下加粗内容所示。
... "tls_config": { "request_coordinator_tls_enable": false, "request_coordinator_tls_items": { "ca_cert" : "./security/request_coordinator/security/certs/ca.pem", "tls_cert": "./security/request_coordinator/security/certs/cert.pem", "tls_key": "./security/request_coordinator/security/keys/cert.key.pem", "tls_passwd": "./security/request_coordinator/security/pass/key_pwd.txt", "kmc_ksf_master": "./security/request_coordinator/tools/pmt/master/ksfa", "kmc_ksf_standby": "./security/request_coordinator/tools/pmt/standby/ksfb", "tls_crl": "" }, "request_server_tls_enable": false, "request_server_tls_items": { "ca_cert" : "./security/request_server/security/certs/ca.pem", "tls_cert": "./security/request_server/security/certs/cert.pem", "tls_key": "./security/request_server/security/keys/cert.key.pem", "tls_passwd": "./security/request_server/security/pass/key_pwd.txt", "kmc_ksf_master": "./security/request_server/tools/pmt/master/ksfa", "kmc_ksf_standby": "./security/request_server/tools/pmt/standby/ksfb", "tls_crl": "" }, "http_server_tls_enable": false, "http_server_tls_items": { "ca_cert" : "./security/http_server/security/certs/ca.pem", "tls_cert": "./security/http_server/security/certs/cert.pem", "tls_key": "./security/http_server/security/keys/cert.key.pem", "tls_passwd": "./security/http_server/security/pass/key_pwd.txt", "kmc_ksf_master": "./security/http_server/tools/pmt/master/ksfa", "kmc_ksf_standby": "./security/http_server/tools/pmt/standby/ksfb", "tls_crl": "" }, "cluster_tls_enable": false, "cluster_tls_items": { "ca_cert" : "./security/cluster/security/certs/ca.pem", "tls_cert": "./security/cluster/security/certs/cert.pem", "tls_key": "./security/cluster/security/keys/cert.key.pem", "tls_passwd": "./security/cluster/security/pass/key_pwd.txt", "kmc_ksf_master": "./security/cluster/tools/pmt/master/ksfa", "kmc_ksf_standby": "./security/cluster/tools/pmt/standby/ksfb", "tls_crl": "" }, "ccae_tls_enable": true, "ccae_tls_items": { "ca_cert": "/mnt/certs/InternalCCAEAgentCA.pem", "tls_cert": "/mnt/certs/kafka_cert.pem", "tls_key": "/mnt/certs/kafka_key.pem", "tls_passwd": "/mnt/certs/psw.txt", "kmc_ksf_master": "./tools_ms/pmt/master/ksfa", "kmc_ksf_standby": "./tools_ms/pmt/standby/ksfb", "tls_crl": "" } }, ...
Controller容器调度的物理机节点上需要放置证书,才能保证Controller的容器中能访问到证书路径;Coordinator默认随机调度,可以修改Coordinator的调度标签或者在每一个节点上都放置证书,保证Coordinator的容器中能访问到证书路径。
- 将证书路径挂载至容器目录,该操作需要在deployment/controller_init.yaml文件中进行挂载,挂载路径如以下配置文件中加粗内容所示。
- (可选)设置以下环境变量开启性能管控。
export MIES_SERVICE_MONITOR_MODE=1
启动服务前需确保主机时间与当地时间对齐,否则会导致CCAE平台观测到的性能数据对应时间不准确。
- 进入/{MindIE安装目录}/latest/mindie-service目录并使用以下命令启动服务。
./bin/mindieservice_daemon
- 服务启动后,OM Adapter将心跳、告警、日志、inventory等信息上报给CCAE管控平台,具体告警信息请参见告警参考章节。
父主题: OM Adapter