开发者
资源

检查升级结果

通过MindCluster Ascend Deployer工具升级完成后,可以通过执行如下操作检查各组件的版本及能否正常工作。

<target>可选范围可通过执行bash install.sh --help查看或参考bash install.sh参数说明
bash install.sh --test=<target>

全量检查

检查已安装的全量组件及版本信息。
bash install.sh   --test=all                                //测试已安装的全量组件及版本信息。

回显信息如下所示。

task info:do dl test on master
[===================================================]

{'xx.xx.xx.43': {'firmware': ['OK', '7.3.0.1.231'], 'nnae': ['not installed', ''], 'nnrt': ['OK', '8.0.RC2'], 'tfplugin': ['OK', '8.0.RC2'], 'driver': ['OK', '24.1.RC2'], 'toolkit': ['not installed', ''], 'pytorch': ['not installed', ''], 'tensorflow': ['ERROR', ''], 'toolbox': ['OK', '6.0.RC2'], 'fault-diag': ['OK', '6.0.RC2'], 'mindspore': ['not installed', '']}}
{'xx.xx.xx.11': {'firmware': ['OK', '7.3.0.1.231'], 'nnae': ['OK', '8.0.RC2'], 'nnrt': ['not installed', ''], 'tfplugin': ['not installed', ''], 'driver': ['OK', '24.1.RC2'], 'toolkit': ['OK', '8.0.RC2'], 'pytorch': ['OK', '1.11.0.post14'], 'tensorflow': ['ERROR', ''], 'toolbox': ['OK', '6.0.RC2'], 'fault-diag': ['not installed', ''], 'mindspore': ['not installed', '']}}

worker-11: {'npu-exporter': ['ERROR', ''], 'noded': ['ERROR', ''], 'ascend-docker-runtime': ['OK', 'v6.0.RC2'], 'ascend-device-plugin': ['ERROR', '']}
worker-43: {'ascend-operator': ['OK', 'v6.0.RC2'], 'resilience-controller': ['OK', 'v6.0.RC2'], 'ascend-device-plugin': ['OK', 'v6.0.RC2'], 'noded': ['OK', 'v6.0.RC2'], 'clusterd': ['OK', 'v6.0.RC2'], 'hccl-controller': ['OK', 'v6.0.RC2'], 'ascend-docker-runtime': ['OK', 'v6.0.RC2'], 'volcano': ['OK', 'v1.7.0'], 'npu-exporter': ['OK', 'v6.0.RC2']}


npu-clusters
------------+--------------+----------------
npu         | driver       | firmware
------------+--------------+----------------
xx.xx.xx.11 | OK, 24.1.RC2 | OK, 7.3.0.1.231
xx.xx.xx.43 | OK, 24.1.RC2 | OK, 7.3.0.1.231
------------+--------------+----------------

cann-clusters
------------+-------------+---------------+---------------+---------------+--------------
cann        | toolbox     | tfplugin      | nnae          | nnrt          | toolkit
------------+-------------+---------------+---------------+---------------+--------------
xx.xx.xx.11 | OK, 6.0.RC2 | not installed | OK, 8.0.RC2   | not installed | OK, 8.0.RC2
xx.xx.xx.43 | OK, 6.0.RC2 | OK, 8.0.RC2   | not installed | OK, 8.0.RC2   | not installed
------------+-------------+---------------+---------------+---------------+--------------

pypkg-clusters
------------+---------------+------------+-------------------+--------------
pypkg       | mindspore     | tensorflow | pytorch           | fault-diag
------------+---------------+------------+-------------------+--------------
xx.xx.xx.11 | not installed | ERROR      | OK, 1.11.0.post14 | not installed
xx.xx.xx.43 | not installed | ERROR      | not installed     | OK, 6.0.RC2
------------+---------------+------------+-------------------+--------------


dl-clusters(master-node)
------------+-----------------+--------------+-----------------+------------+----------------------
master-node | ascend-operator | clusterd     | hccl-controller | volcano    | resilience-controller
------------+-----------------+--------------+-----------------+------------+----------------------
worker-43   | OK, v6.0.RC2    | OK, v6.0.RC2 | OK, v6.0.RC2    | OK, v1.7.0 | OK, v6.0.RC2
------------+-----------------+--------------+-----------------+------------+----------------------

dl-clusters(worker-node)
------------+----------------------+-----------------------+--------------+-------------
worker-node | ascend-device-plugin | ascend-docker-runtime | noded        | npu-exporter
------------+----------------------+-----------------------+--------------+-------------
worker-11   | ERROR                | OK, v6.0.RC2          | ERROR        | ERROR
worker-43   | OK, v6.0.RC2         | OK, v6.0.RC2          | OK, v6.0.RC2 | OK, v6.0.RC2
------------+----------------------+-----------------------+--------------+-------------
 => localhost
ascend deployer processed success
run cmd: --test=all successfully

检查NPU固件与驱动、CANN软件、MindCluster组件(性能测试、故障诊断)

可执行如下命令查询已安装软件及版本信息。

bash install.sh --test=toolbox                                     //测试toolbox是否正常

检查MindCluster集群调度

方式一:执行如下命令检查已安装的MindCluster集群调度及版本信息。示例如下。

bash install.sh   --test=clusterd 

方式二:安装MindCluster集群调度后,可进入/ascend_deployer/report目录查看报告文件“report.json”“report.csv”查看安装结果及节点状态。