Log Description
Ascend DMI records logs when executing commands. The logs are stored in the following paths:
- root user: /var/log/ascend-dmi
- Non-root user: ~/var/log/ascend-dmi
When the size of the log file exceeds 10 MB, the file is saved as .XX.gz (XX is a natural number starting from 1). The total number of files to be saved cannot exceed 10. When the number of log files exceeds the maximum value, the earliest log files are deleted.
- When Ascend DMI fails to obtain a device type, logs are dumped to the preceding default path.
- When the device type queried by Ascend DMI is Atlas 500 A2 edge station, if the size of a log file exceeds 1MB, the log file is dumped to .XX.gz (XX is a natural number starting from 1). The total number of dumped files cannot exceed 10. When the number exceeds the maximum, the earliest log files are deleted. The path to save logs is /home/log/ascend-dmi.
- Debug logs whose size exceeds 10 MB are dumped only to the /var/log/ascend-dmi directory. Save debug logs in a timely manner in the Atlas 500 A2 edge station to prevent log loss upon restart.
Log Backup
For the Atlas 500 A2 edge station, ToolBox logs in the original storage path are deleted after a driver restart. However, the driver saves the logs in reboot_back_up_XX.tar.gz in the /home/log/kbox_last_logs/ directory. After decompressing the package, you can view the logs before the restart.
Data Flushing
If outputs of the bandwidth, computing power, and NIC diagnosis are in JSON format, data is flushed to drives, and the bandwidth or computing power of a specific device is displayed. Data flushing files are stored in the following paths:
- root user: /var/log/ascend_check/result.txt
- Non-root user: ~/var/log/ascend_check/result.txt
For example, data flushing files are generated when the following commands are executed:
ascend-dmi -dg -i bandwidth -fmt json
ascend-dmi -dg -i aiflops -fmt json
ascend-dmi -dg -fmt json
ascend-dmi -dg -i nic -fmt json
- For the Atlas 200T A2 Box16/Atlas 200I A2 Box16 heterogeneous subrack in the virtual machine scenario, due to the particularity of data transmission channels, the bandwidth test is not performed between two 8-NPU groups.
- Output of bandwidth and computing power diagnosis on the
Atlas A2 training product and Atlas 800I A2 inference product{ "device_0": { "aiflops": "287.95", "d2d bandwidth": "743.41", "d2d write bandwidth": "740.86", "d2h bandwidth": "28.07", "h2d bandwidth": "25.12", "p2p bidirectional bandwidth": "X", "p2p bidirectional write bandwidth": "X", "p2p unidirectional bandwidth": "X", "p2p unidirectional write bandwidth": "X" } } - Output of NIC diagnosis on the Atlas 900 A2 PoD cluster basic unit
{ "device_0": { "nic roce read bandwidth": "device_7: 22.716700, device_1: 22.716524, device_6: 22.716612", "nic roce send bandwidth": "device_6: 22.739834, device_1: 22.739473, device_7: 22.739336", "nic roce write bandwidth": "device_1: 22.717470, device_7: 22.717920, device_6: 22.716806" }, "device_1": { "nic roce read bandwidth": "device_0: 22.716396, device_6: 22.716591, device_7: 22.716866", "nic roce send bandwidth": "device_0: 22.739386, device_7: 22.740028, device_6: 22.739374", "nic roce write bandwidth": "device_0: 22.716515, device_6: 22.716797, device_7: 22.716660" }, "device_2": { "nic roce read bandwidth": "device_4: 22.716534, device_5: 22.716562, device_3: 22.716787", "nic roce send bandwidth": "device_4: 22.739746, device_5: 22.739492, device_3: 22.739464", "nic roce write bandwidth": "device_3: 22.718027, device_4: 22.716728, device_5: 22.716581" }, "device_3": { "nic roce read bandwidth": "device_2: 22.716768, device_5: 22.716759, device_4: 22.716738", "nic roce send bandwidth": "device_2: 22.739170, device_5: 22.739248, device_4: 22.739483", "nic roce write bandwidth": "device_2: 22.716377, device_5: 22.716700, device_4: 22.717323" }, "device_4": { "nic roce read bandwidth": "device_2: 22.716816, device_3: 22.716747, device_5: 22.716280", "nic roce send bandwidth": "device_2: 22.739374, device_3: 22.739355, device_5: 22.739552", "nic roce write bandwidth": "device_2: 22.716934, device_5: 22.716484, device_3: 22.717091" }, "device_5": { "nic roce read bandwidth": "device_4: 22.717598, device_3: 22.717157, device_2: 22.717579", "nic roce send bandwidth": "device_4: 22.739483, device_3: 22.739492, device_2: 22.739336", "nic roce write bandwidth": "device_4: 22.716825, device_2: 22.713037, device_3: 22.716856" }, "device_6": { "nic roce read bandwidth": "device_0: 22.716681, device_7: 22.716719, device_1: 22.716681", "nic roce send bandwidth": "device_0: 22.739630, device_1: 22.739414, device_7: 22.739374", "nic roce write bandwidth": "device_0: 22.716446, device_7: 22.718134, device_1: 22.717091" }, "device_7": { "nic roce read bandwidth": "device_6: 22.717169, device_1: 22.716700, device_0: 22.717842", "nic roce send bandwidth": "device_6: 22.739590, device_1: 22.739199, device_0: 22.739590", "nic roce write bandwidth": "device_6: 22.716631, device_1: 22.716846, device_0: 22.716806" } } - Output of bandwidth diagnosis on the
Atlas A3 training product { "device_all": { "d2h bandwidth": "356.64", "h2d bandwidth": "297.58" }, "device_0": { "d2d bandwidth": "1516.66", "d2d write bandwidth": "1484.89", "p2p bidirectional bandwidth": "X, 366.68, 270.14, 269.81, 270.05, 269.96, 269.74, 269.78, 270.09, 270.03, 269.88, 269.82, 269.93, 269.97, 269.80, 269.80", "p2p bidirectional write bandwidth": "X, 343.01, 250.02, 247.81, 248.13, 245.36, 246.59, 247.84, 246.04, 246.09, 248.19, 246.14, 245.33, 246.54, 248.46, 246.87", "p2p unidirectional bandwidth": "X, 202.95, 164.73, 164.72, 164.78, 164.74, 164.73, 164.72, 164.77, 164.74, 164.75, 164.71, 164.75, 164.77, 164.75, 164.72", "p2p unidirectional write bandwidth": "X, 191.71, 137.20, 137.41, 137.26, 137.21, 137.46, 137.49, 136.72, 137.18, 137.45, 137.40, 137.00, 137.11, 137.46, 137.34" }, "device_1": { "d2d bandwidth": "1528.46", "d2d write bandwidth": "1470.62", "p2p bidirectional bandwidth": "368.80, X, 269.88, 269.87, 269.99, 270.09, 269.81, 269.74, 270.06, 270.02, 269.96, 270.03, 270.13, 269.98, 269.94, 269.92", "p2p bidirectional write bandwidth": "340.45, X, 246.08, 247.17, 246.87, 245.46, 245.34, 248.06, 244.03, 243.79, 247.23, 243.96, 243.78, 246.07, 247.84, 247.25", "p2p unidirectional bandwidth": "202.96, X, 164.73, 164.74, 164.74, 164.76, 164.73, 164.74, 164.77, 164.77, 164.73, 164.77, 164.76, 164.78, 164.76, 164.73", "p2p unidirectional write bandwidth": "191.68, X, 137.08, 137.58, 137.36, 136.93, 137.32, 137.66, 136.75, 136.97, 137.38, 137.24, 136.73, 137.26, 137.54, 137.35" } }
Parameter |
Description |
|---|---|
Numerical value |
Bandwidth or computing power of a specific device. The bandwidth unit is GB/s, and the computing power unit is TFLOPS. |
X/NA |
The value cannot be displayed. |
FAIL |
Execution failed. |