检查升级结果
通过MindCluster Ascend Deployer工具升级完成后,可以通过执行如下操作检查各组件的版本及能否正常工作。
<target>可选范围可通过执行bash install.sh --help查看或参考install参数说明。
bash install.sh --test=<target>
全量检查
检查已安装的全量组件及版本信息。
bash install.sh --test=all //测试已安装的全量组件及版本信息。
回显信息如下所示。
task info:do dl test on master
[===================================================]
{'51.38.68.75': {'driver': ['OK', '24.1.0.1'], 'firmware': ['OK', '7.5.0.5.220'], 'mcu': ['npu_id_2', '24.5.7'], 'toolbox': ['OK', '7.0.RC1'], 'mindspore': ['OK', '2.5.0'], 'pytorch': ['OK', '2.1.0.post8'], 'tensorflow': ['OK', '2.6.5'], 'tfplugin': ['not installed', ''], 'nnae': ['OK', '8.0.0'], 'nnrt': ['OK', '8.0.0'], 'toolkit': ['OK', '8.0.0'], 'fault-diag': ['OK', '7.0.RC1'], 'mindie_image': ['OK', '1.0.0']}}
ubuntu: {'ascend-operator': ['OK', 'v7.0.RC1'], 'clusterd': ['OK', 'v7.0.RC1'], 'resilience-controller': ['OK', 'v7.0.RC1'], 'hccl-controller': ['not installed', ''], 'volcano': ['OK', 'v1.7.0'], 'ascend-device-plugin': ['OK', 'v7.0.RC1'], 'noded': ['OK', 'v7.0.RC1'], 'npu-exporter': ['OK', 'v7.0.RC1'], 'ascend-docker-runtime': ['OK', 'v7.0.RC1']}
npu-clusters
------------+--------------+----------------
npu | driver | firmware
------------+--------------+----------------
51.38.68.75 | OK, 24.1.0.1 | OK, 7.5.0.5.220
------------+--------------+----------------
mcu-version
------------+--------------
ip | npu_id_2
------------+--------------
51.38.68.75 | 24.5.7
------------+--------------
cann-clusters
------------+-------------+---------------+---------------+---------------+--------------+---------------
cann | toolbox | tfplugin | nnae | nnrt | toolkit | mindie_image
------------+-------------+---------------+---------------+---------------+--------------+---------------
51.38.68.75 | OK, 7.0.RC1 | not installed | OK, 8.0.0 | OK, 8.0.0 | OK, 8.0.0 | OK, 1.0.0
------------+-------------+---------------+---------------+---------------+--------------+---------------
pypkg-clusters
------------+---------------+------------+-------------------+--------------
pypkg | mindspore | tensorflow | pytorch | fault-diag
------------+---------------+------------+-------------------+--------------
51.38.68.75 | OK, 2.5.0 | OK, 2.6.5 | OK, 2.1.0.post8 | OK, 7.0.RC1
------------+---------------+------------+-------------------+--------------
dl-clusters(master-node)
------------+-----------------+--------------+-----------------+------------+----------------------
master-node | ascend-operator | clusterd | hccl-controller | volcano | resilience-controller
------------+-----------------+--------------+-----------------+------------+----------------------
ubuntu | OK, v7.0.RC1 | OK, v7.0.RC1 | not installed | OK, v1.7.0 | OK, v7.0.RC1
------------+-----------------+--------------+-----------------+------------+----------------------
dl-clusters(worker-node)
------------+----------------------+-----------------------+--------------+-------------
worker-node | ascend-device-plugin | ascend-docker-runtime | noded | npu-exporter
------------+----------------------+-----------------------+--------------+-------------
ubuntu | OK, v7.0.RC1 | OK, v7.0.RC1 | OK, v7.0.RC1 | OK, v7.0.RC1
------------+----------------------+-----------------------+--------------+-------------
=> localhost
ascend deployer processed success
run cmd: --test=all successfully
检查NPU固件与驱动、CANN软件、MindCluster组件(性能测试、故障诊断)
可执行如下命令查询已安装软件及版本信息。
bash install.sh --test=toolbox //测试toolbox是否正常
检查MindCluster集群调度
方式一:执行如下命令检查已安装的MindCluster集群调度及版本信息。示例如下。
bash install.sh --test=clusterd
方式二:安装MindCluster集群调度后,可进入/ascend_deployer/report目录查看报告文件“report.json”或“report.csv”查看安装结果及节点状态。
检查MCU固件
可执行如下命令查询MCU固件信息。
bash install.sh --test=mcu
父主题: 升级昇腾软件