Host Logs

File Description

Log Name

Naming Constraint

Storage Path

Host OS logs

messages-*?

Collection directory

Host kernel message logs

dmesg

Host system monitoring logs

sysmonitor.log

Host kernel message logs when the system breaks down

vmcore-dmesg.txt

Collecting Host OS Logs

  1. Go to the log storage directory and open the messages file.
    cd /var/log && vi messages
  2. Obtain the log information based on the training or inference start time and end time, create the messages file in the collection directory, and dump the log content.
    cd Collection_directory/ && vi messages

    Dump log information. A log example is as follows:

    Aug 13 03:19:24 # A training or inference job starts.
    ...
    Aug 13 04:14:39 # A training or inference job ends.

    Run the :wq command to save the file and exit. The log content varies according to the actual file.

Collecting Host Kernel Message Logs

Run the following command to collect the latest dmesg log and place it in the collection directory. A maximum of 100,000 lines can be collected.
dmesg -T | tail -n 100000 > Collection directory/dmesg

A log example is as follows:

[Fri Aug 30 16:42:49 2024] Log printing
…
[Fri Aug 30 16:42:49 2024] Log printing

Collecting Host System Monitoring Logs

Copy the sysmonitor.log file to the collection directory.
cp -r /var/log/sysmonitor.log Collection_directory/

A log example is as follows:

2024-08-27T19:54:48.242959+00:00|info|sysmonitor[xxxxx]: Log printing
     …
2024-08-27T19:54:48.343493+00:00|info|sysmonitor[xxxxx]: Log printing

Collecting Host Kernel Message Logs when the System Breaks Down

Host kernel message logs are host kernel message files saved when the system breaks down. Perform the following steps to capture these logs:

Copy the vmcore-dmesg.txt file to the collection directory.
cp -r /var/crash/Collection_directory/

A log example is as follows:

[292.448078] Log printing
……
[292.448080] Log printing

Collecting dmidecode Logs on the Host

The host-side dmidecode logs contain DMI hardware information.

Run the following command to collect them:
dmidecode > dmidecode.txt