Atlas 800I A2 推理服务器场景下,MindCluster支持用户通过acjob推理任务的方式进行MindIE Service的容器化部署。本章节只指导用户完成yaml的配置及RankTable的生成。了解MindIE Service的详细部署流程请参见《MindIE Service开发指南》。
对于包含片上内存的芯片,Ascend Device Plugin启动时上报芯片内存情况,见node-label说明;上报整卡信息,将芯片的物理ID上报到device-info-cm中;可调度的芯片总数量(allocatable)、已使用的芯片数量(allocated)和芯片的基础信息(device ip和super_device_ip)上报到node中,用于整卡调度。
apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: mindie-ms-test-controller namespace: mindie labels: framework: pytorch app: mindie-ms-controller # do not modify jobID: mindie-ms-test # uid of infer job, modify it according to your job ring-controller.atlas: ascend-910b spec: schedulerName: volcano # work when enableGangScheduling is true runPolicy: schedulingPolicy: # work when enableGangScheduling is true minAvailable: 1 # equal to Master.replicas queue: default successPolicy: AllWorkers replicaSpecs: Master: replicas: 1 restartPolicy: Always template: metadata: ...
apiVersion: v1 kind: ConfigMap metadata: name: rings-config-mindie-server-0 # The name must be the same as the name attribute of the following AscendJob. The prefix rings-config- cannot be modified. namespace: mindie labels: jobID: mindie-ms-test ring-controller.atlas: ascend-910b mx-consumer-cim: "true" data: hccl.json: | { "status":"initializing" } --- apiVersion: mindxdl.gitee.com/v1 kind: AscendJob metadata: name: mindie-server-0 namespace: mindie labels: framework: pytorch app: mindie-ms-server # do not modify jobID: mindie-ms-test # uid of infer job, modify it according to your job ring-controller.atlas: ascend-910b fault-scheduling: force spec: schedulerName: volcano # work when enableGangScheduling is true runPolicy: schedulingPolicy: # work when enableGangScheduling is true minAvailable: 2 # should equal to Master.replicas + Worker.replicas queue: default successPolicy: AllWorkers replicaSpecs: Master: