安装完成后可查看各组件Pod状态,检查各组件是否安装成功。
kubectl get pod --all-namespaces
示例如下所示。
NAMESPACE NAME READY STATUS RESTARTS AGE ... kube-system ascend-device-plugin-daemonset-c2jw2 1/1 Running 0 10d ... mindx-dl hccl-controller-855dcbd6b8-2gz2s 1/1 Running 0 10d mindx-dl noded-6knvm 1/1 Running 0 10d mindx-dl resilience-controller-7667495b6b-hwmjw 1/1 Running 0 5m ... npu-exporter npu-exporter-sxpxj 1/1 Running 0 8d volcano-system volcano-controllers-7785fd66cc-8vfhn 1/1 Running 0 9d volcano-system volcano-scheduler-74cc875758-2mntk 1/1 Running 0 9d ...
kubectl describe node <node-name>
示例如下所示。
kubectl describe node ubuntu
包含Atlas训练系列产品的节点详情会显示如下内容,以Atlas 800 训练服务器为例。
Name: ubuntu Roles: worker ... node-role.kubernetes.io/worker=worker workerselector=dls-worker-node ... Capacity: cpu: 72 ephemeral-storage: 163760Mi huawei.com/Ascend910: 8 ... Allocatable: cpu: 72 ephemeral-storage: 154543324929 huawei.com/Ascend910: 8 ...
包含昇腾310 AI处理器的节点详情会显示如下内容,以服务器(插Atlas 300I 推理卡)为例。
Name: ubuntu Roles: worker ... node-role.kubernetes.io/worker=worker workerselector=dls-worker-node ... Capacity: cpu: 72 ephemeral-storage: 163760Mi huawei.com/Ascend310: 4 ... Allocatable: cpu: 72 ephemeral-storage: 154543324929 huawei.com/Ascend310: 4 ...
Name: ubuntu Roles: worker ... node-role.kubernetes.io/worker=worker workerselector=dls-worker-node ... Capacity: cpu: 96 ephemeral-storage: 95596964Ki huawei.com/Ascend310P: 3 ... Allocatable: cpu: 96 ephemeral-storage: 88102161877 huawei.com/Ascend310P: 3 ...
kubectl describe cm mindx-dl-deviceinfo-<node-name> -n kube-system
示例如下所示。
kubectl describe cm mindx-dl-deviceinfo-ubuntu -n kube-system
包含昇腾910芯片的节点详情会显示如下内容,以Atlas 800 训练服务器为例。
Name: mindx-dl-deviceinfo-ubuntu Namespace: kube-system Labels: <none> Annotations: <none> Data ==== DeviceInfoCfg: ---- {"DeviceInfo":{"DeviceList":{"huawei.com/Ascend910":"Ascend910-0,Ascend910-1,Ascend910-2,Ascend910-3,Ascend910-4,Ascend910-5,Ascend910-6,Ascend910-7","huawei.com/Ascend910-NetworkUnhealthy":"","huawei.com/Ascend910-Unhealthy":""},"UpdateTime":1661427309},"CheckCode":"6e3cb635bf6763db79d4fc73fca593d29374c312850889b0660604ab64c feb2f"} Events: <none>
此数据仅作展示使用,可能与实际使用场景不符。