昇腾社区首页
中文
注册

日志说明

Ascend DMI工具在执行命令行操作时会记录日志,日志存放路径如下:

  • root用户:/var/log/ascend-dmi
  • 非root用户:~/var/log/ascend-dmi

当日志文件大小超过10MB后,将转存为日志文件.XX.gzXX按自然数从1开始递增),所有转存文件总量不超过10,超过时将删除转存日期最早日志以维持最大日志文件数量。

  1. Ascend DMI工具试图获取设备类型失败时,将按照上述默认路径进行转存。
  2. Ascend DMI工具设备类型为Atlas 500 A2 智能小站时,日志文件大小超过1MB后,将转存为日志文件.XX.gzXX按自然数从1开始递增),所有转存文件总数量不超过10个,超过时将删除转存日期最早日志以维持最大日志文件数量。其转存日志存放路径为:/home/log/ascend-dmi。
  3. debug日志只会转存至/var/log/ascend-dmi目录下,且文件大小为10MB时会进行转存,在Atlas 500 A2 智能小站上请注意及时保存debug日志,防止重启发生丢失。

日志备份

当设备类型为Atlas 500 A2 智能小站时,因驱动重启后,会清除原日志存放路径下的toolbox的日志文件,但驱动会将其保存在“/home/log/kbox_last_logs/”路径下的压缩文件reboot_back_up_XX.tar.gz中,解压后查看重启前的日志文件。

数据落盘

在执行带宽诊断、算力诊断、NIC诊断时,如果执行诊断输出的格式为JSON,将会进行数据落盘操作,显示具体Device对应的带宽或算力数值。数据落盘文件存放路径如下:

  • root用户:/var/log/ascend_check/result.txt
  • 非root用户:~/var/log/ascend_check/result.txt

例如以下执行指令,都会生成数据落盘文件:

ascend-dmi -dg -i bandwidth -fmt json
ascend-dmi -dg -i aiflops -fmt json
ascend-dmi -dg -fmt json
ascend-dmi -dg -i nic -fmt json
  • Atlas 200T A2 Box16 异构子框在虚拟机场景下,由于数据传输通道的特殊性,BandWidth诊断将不执行两个8p之间的P2P测试。
  • 使用Atlas A2 训练系列产品Atlas 800I A2推理产品,执行带宽和算力诊断时,回显如下:
    {
        "device_0": {
            "aiflops": "287.95",
            "d2d bandwidth": "743.41",
            "d2d write bandwidth": "740.86",
            "d2h bandwidth": "28.07",
            "h2d bandwidth": "25.12",
            "p2p bidirectional bandwidth": "X",
            "p2p bidirectional write bandwidth": "X",
            "p2p unidirectional bandwidth": "X",
            "p2p unidirectional write bandwidth": "X"
        }
    }
  • 使用Atlas 900 A2 PoD 集群基础单元,执行NIC诊断时,回显如下:
    {
        "device_0": {
            "nic roce read bandwidth": "device_7: 22.716700, device_1: 22.716524, device_6: 22.716612",
            "nic roce send bandwidth": "device_6: 22.739834, device_1: 22.739473, device_7: 22.739336",
            "nic roce write bandwidth": "device_1: 22.717470, device_7: 22.717920, device_6: 22.716806"
        },
        "device_1": {
            "nic roce read bandwidth": "device_0: 22.716396, device_6: 22.716591, device_7: 22.716866",
            "nic roce send bandwidth": "device_0: 22.739386, device_7: 22.740028, device_6: 22.739374",
            "nic roce write bandwidth": "device_0: 22.716515, device_6: 22.716797, device_7: 22.716660"
        },
        "device_2": {
            "nic roce read bandwidth": "device_4: 22.716534, device_5: 22.716562, device_3: 22.716787",
            "nic roce send bandwidth": "device_4: 22.739746, device_5: 22.739492, device_3: 22.739464",
            "nic roce write bandwidth": "device_3: 22.718027, device_4: 22.716728, device_5: 22.716581"
        },
        "device_3": {
            "nic roce read bandwidth": "device_2: 22.716768, device_5: 22.716759, device_4: 22.716738",
            "nic roce send bandwidth": "device_2: 22.739170, device_5: 22.739248, device_4: 22.739483",
            "nic roce write bandwidth": "device_2: 22.716377, device_5: 22.716700, device_4: 22.717323"
        },
        "device_4": {
            "nic roce read bandwidth": "device_2: 22.716816, device_3: 22.716747, device_5: 22.716280",
            "nic roce send bandwidth": "device_2: 22.739374, device_3: 22.739355, device_5: 22.739552",
            "nic roce write bandwidth": "device_2: 22.716934, device_5: 22.716484, device_3: 22.717091"
        },
        "device_5": {
            "nic roce read bandwidth": "device_4: 22.717598, device_3: 22.717157, device_2: 22.717579",
            "nic roce send bandwidth": "device_4: 22.739483, device_3: 22.739492, device_2: 22.739336",
            "nic roce write bandwidth": "device_4: 22.716825, device_2: 22.713037, device_3: 22.716856"
        },
        "device_6": {
            "nic roce read bandwidth": "device_0: 22.716681, device_7: 22.716719, device_1: 22.716681",
            "nic roce send bandwidth": "device_0: 22.739630, device_1: 22.739414, device_7: 22.739374",
            "nic roce write bandwidth": "device_0: 22.716446, device_7: 22.718134, device_1: 22.717091"
        },
        "device_7": {
            "nic roce read bandwidth": "device_6: 22.717169, device_1: 22.716700, device_0: 22.717842",
            "nic roce send bandwidth": "device_6: 22.739590, device_1: 22.739199, device_0: 22.739590",
            "nic roce write bandwidth": "device_6: 22.716631, device_1: 22.716846, device_0: 22.716806"
        }
    }
  • 使用Atlas A3 训练系列产品,执行带宽诊断时,回显如下:
    {
        "device_all": {
            "d2h bandwidth": "356.64",
            "h2d bandwidth": "297.58"
        },
        "device_0": {
            "d2d bandwidth": "1516.66",
            "d2d write bandwidth": "1484.89",
            "p2p bidirectional bandwidth": "X, 366.68, 270.14, 269.81, 270.05, 269.96, 269.74, 269.78, 270.09, 270.03, 269.88, 269.82, 269.93, 269.97, 269.80, 269.80",
            "p2p bidirectional write bandwidth": "X, 343.01, 250.02, 247.81, 248.13, 245.36, 246.59, 247.84, 246.04, 246.09, 248.19, 246.14, 245.33, 246.54, 248.46, 246.87",
            "p2p unidirectional bandwidth": "X, 202.95, 164.73, 164.72, 164.78, 164.74, 164.73, 164.72, 164.77, 164.74, 164.75, 164.71, 164.75, 164.77, 164.75, 164.72",
            "p2p unidirectional write bandwidth": "X, 191.71, 137.20, 137.41, 137.26, 137.21, 137.46, 137.49, 136.72, 137.18, 137.45, 137.40, 137.00, 137.11, 137.46, 137.34"
        },
        "device_1": {
            "d2d bandwidth": "1528.46",
            "d2d write bandwidth": "1470.62",
            "p2p bidirectional bandwidth": "368.80, X, 269.88, 269.87, 269.99, 270.09, 269.81, 269.74, 270.06, 270.02, 269.96, 270.03, 270.13, 269.98, 269.94, 269.92",
            "p2p bidirectional write bandwidth": "340.45, X, 246.08, 247.17, 246.87, 245.46, 245.34, 248.06, 244.03, 243.79, 247.23, 243.96, 243.78, 246.07, 247.84, 247.25",
            "p2p unidirectional bandwidth": "202.96, X, 164.73, 164.74, 164.74, 164.76, 164.73, 164.74, 164.77, 164.77, 164.73, 164.77, 164.76, 164.78, 164.76, 164.73",
            "p2p unidirectional write bandwidth": "191.68, X, 137.08, 137.58, 137.36, 136.93, 137.32, 137.66, 136.75, 136.97, 137.38, 137.24, 136.73, 137.26, 137.54, 137.35"
        }
    }
表1 显示结果参数说明

参数

说明

具体Device对应的带宽或算力数值。

带宽诊断单位为GB/s,算力诊断单位为TFLOPS。

X/NA

不支持显示此数值。

FAIL

执行结果失败。