How Do I Determine the Start and End Time of the Host Process Based on the Device Log?

When locating a fault, you need to preliminarily locate the time when the fault occurs, and then locate the fault based on the detailed logs generated at the time when the fault occurs. In this case, you can search for the ringbuffer keyword in the device log to determine the start and end time of the service process and locate the time when the problem occurs.

Ringbuffer is a piece of shared memory for reporting error information. When a task is started, the create ringbuffer operation is performed. When the task is complete, the delete ringbuffer operation is performed. Therefore, you can check the start and end time of the task based on the ringbuffer log. The detailed operations are as follows:

Ascend EP Standard Form

  1. Log in to the host server.
  2. Use the msnpureport tool to export the device log.
    Run the msnpureport tool in a directory on which you have the read, write, and execute permissions (for example, /var/log/npu/report). The following is an example of the msnpureport tool command. /usr/local/Ascend is the default installation path of the driver package. Replace it with the actual path.
    /usr/local/Ascend/driver/tools/msnpureport -f
  3. In the root directory of the exported system logs on the non-control CPU of the device, query the process creation and destruction logs.

    For example, run the following command to query the log information about the process startup and end on device 0 in the /var/log/npu/report/*/slog/device-os-0/[run|debug]/device-0 directory. The asterisk (*) indicates the timestamp when the log file is created.

    grep -rn ringbuffer

    A log snippet is as follows:

    device-0_20231201062601524.log:23102:[EVENT] TSCH(-1,null):2023-12-02-14:29:25.780.720 17641 (dieid:0,cpuid:0) ringbuffer.c:177 delete_ringbuffer: pid=3299758, vf_id=0
    device-0_20231201062601524.log:23103:[EVENT] TSCH(-1,null):2023-12-02-14:29:25.780.756 17642 (dieid:0,cpuid:0) ringbuffer.c:134 create_ringbuffer: pid=3299758, vf_id=0, buffer_len=10, runtime ver=2, offset=276096, ringbuffer_addr=0x1f7dc00000.
    device-0_20231201062601524.log:23104:[EVENT] TSCH(-1,null):2023-12-02-14:29:25.780.774 17643 (dieid:0,cpuid:0) ringbuffer.c:163 create_ringbuffer: last_Id=21, vf_id=0, pid=3299758
    device-0_20231204114303693.log:23503:[EVENT] TSCH(-1,null):2023-12-04-11:44:15.752.017 47266 (dieid:0,cpuid:0) ringbuffer.c:177 delete_ringbuffer: pid=3299758, vf_id=0

    As shown in the preceding figure:

    • The keyword create_ringbuffer in the log indicates that the process is started. The creation time of the process whose ID is 3299758 on device-0 in the preceding figure is 2023-12-02-14:29:25.

      To avoid repeated creation, a ringbuffer is deleted each time before another ringbuffer is created. Therefore, the generated logs include delete_ringbuffer and create_ringbuffer.

      In the log file, if there are multiple create_ringbuffer records for the same process, the log time that contains ringbuffer_addr indicates the time when the process is started.

    • The keyword delete_ringbuffer in the log indicates that the process is ended. The destroy time of the process whose ID is 3299758 on device-0 in the preceding figure is 2023-12-04-11:44:15.

      In the log, if there are multiple delete_ringbuffer records for the same process, the time of the last delete_ringbuffer record indicates the end time of the process.

Ascend RC Form

  1. Log in to the board environment.
  2. Check device logs.

    For example, run the following command to query the log information about the process startup and end on device 0 in the /var/log/npu/slog/[run|debug]/device-0 directory:

    grep -rn ringbuffer

    A log example is as follows:

    device-0_20231201062601524.log:23102:[EVENT] TSCH(-1,null):2023-12-02-14:29:25.780.720 17641 (dieid:0,cpuid:0) ringbuffer.c:177 delete_ringbuffer: pid=3299758, vf_id=0
    device-0_20231201062601524.log:23103:[EVENT] TSCH(-1,null):2023-12-02-14:29:25.780.756 17642 (dieid:0,cpuid:0) ringbuffer.c:134 create_ringbuffer: pid=3299758, vf_id=0, buffer_len=10, runtime ver=2, offset=276096, ringbuffer_addr=0x1f7dc00000.
    device-0_20231201062601524.log:23104:[EVENT] TSCH(-1,null):2023-12-02-14:29:25.780.774 17643 (dieid:0,cpuid:0) ringbuffer.c:163 create_ringbuffer: last_Id=21, vf_id=0, pid=3299758
    device-0_20231204114303693.log:23503:[EVENT] TSCH(-1,null):2023-12-04-11:44:15.752.017 47266 (dieid:0,cpuid:0) ringbuffer.c:177 delete_ringbuffer: pid=3299758, vf_id=0

    As shown in the preceding figure:

    • The keyword create_ringbuffer in the log indicates the process startup. Therefore, the creation time of the process whose ID is 3299758 on device-0 in the preceding example is 2023-12-02-14:29:25.

      To avoid repeated creation, the existing ring buffer is deleted each time before another ring buffer is created. Therefore, the creation logs include delete_ringbuffer and create_ringbuffer.

      In the log, if multiple create_ringbuffer records are found for the same process, the time of the log record containing ringbuffer_addr is the time when the process was initially started.

    • The keyword delete_ringbuffer in the log indicates the process end. Therefore, the destruction time of the process whose ID is 3299758 on device-0 in the preceding example is 2023-12-04-11:44:15.

      In the log, if multiple delete_ringbuffer records are found for the same process, the time of the last delete_ringbuffer record is the end time of the process.