Reported error in D2D copy during operator execution.
Symptom
Collect the log file by referring to Collect Information About Process Interruption. The following uses ${HOME}/err_log_info/ as an example of directory for storing collected logs.
The host application log file (${HOME}/err_log_info/log/[run|debug]/plog/plog-pid_*.log) contains the error information about the execution of the SDMA task (D2D copy task). A log example is as follows:
[ERROR] RUNTIME(33549,python3):2024-01-16-06:49:00.516.893 [device_error_proc.cc:1226]122568 ProcessStarsSdmaErrorInfo:[FINAL][FINAL]The error from device(chipId:7, dieId:0), serial number is 1. there is a fftsplus sdma error, sdma channel is 6, sdmaState=0x6, sdmaTslotid=0x5, sdmaCxtid=0x1, sdmaThreadid=0x0, irqStatus=0x420000, cqeStatus=0x150000.
Fault Root Causes
Based on the log information, obtain cqeStatus=0x150000 and shift the parameter value rightwards by one digit (cqeStatus>>1) to calculate the actual error code 000Ah. The meaning of it is as follows.
Error Code |
Description |
Possible Cause |
|---|---|---|
000Ah |
SDMAA error: COMPDATAERR in SDMAA transfer |
The data is abnormal. An error is returned when the HBM is accessed. Generally, there is a high probability that the HBM bit ECC is faulty. |
Solution
Search for the keyword event_id in the slog file (report/*/slog/dev-os-id/[run|debug]/device-os/device-os_*.log) of the device, obtain the parameter value (error code), click here to search for Health Management Fault Definition of the corresponding product, and refer to the solution to the fault.