EJ0001 Error Message

Symptom

In the plog, the first error reported by TDT is "[DeviceMsgProcess][tid:1241254] [TsdClient] DeviceMsgProc errcode[EJ0001]."

1
2
3
4
[ERROR] TDT(685010,all_reduce_test):2023-11-29-11:55:41.334.702 [process_mode_manager.cpp:587][DeviceMsgProcess][tid:685010] [TsdClient] DeviceMsgProc  errcode[EJ0001]
[ERROR] TDT(685010,all_reduce_test):2023-11-29-11:55:41.334.873 [process_mode_manager.cpp:269][WaitRsp][tid:685010] tsd client wait response fail, device response code[1]. unknown device error.
[ERROR] TDT(685010,all_reduce_test):2023-11-29-11:55:41.334.893 [process_mode_manager.cpp:123][OpenProcess][tid:685010] Wait open response from device failed.
[ERROR] TDT(685010,all_reduce_test):2023-11-29-11:55:41.334.897 [tsd_client.cpp:31][TsdOpen][tid:685010] TsdOpen failed, deviceId[4].

The EJ0001 error only indicates that the device HCCP process fails to be started. The specific failure cause needs to be further identified based on the device error. In the debug directory, run the /usr/local/Ascend/driver/tools/msnpureport -f command to export the device log and run the grep -rn ERROR * | grep HCCP command to view the HCCP first error. The device error may occur in the following two scenarios:

  • The device reports the error message "Create Server failed, ret (61)".
    The EJ0001 error is reported. Generally, the HCCP error message "ra_hdc_server_init(1425): Create Server failed, ret(61)" is displayed.
    1
    2
    3
    [ERROR] HCCP(13722,hccp_service.bin):2023-11-29-11:55:41.806.183 [ra_adp.c:1425]tid:13722,ra_hdc_server_init(1425) : Create Server failed, ret(61)
    [ERROR] HCCP(13722,hccp_service.bin):2023-11-29-11:55:41.806.200 [ra_adp.c:1546]tid:13722,hccp_init(1546) : chip_id[0] ra_hdc_server_init failed, ret[-22]
    [ERROR] HCCP(13722,hccp_service.bin):2023-11-29-11:55:41.806.265 [main.c:224]tid:13722,main(224) : hccp init error[-22]
    
  • The device reports the error message "certificate is not yet valid".

Possible Cause and Solution (Error Message: "Create Server failed, ret(61)")

If the device reports the error message "Create Server failed, ret(61)", the possible causes are as follows:

  • Operation problem (occurring occasionally): A new training job is started before the previous training job completely exits.

    Possible Cause:

    Run the following hccn_tool command on the host:
    for i in {0..7}; do hccn_tool -i $i -process -g ; done

    Check whether the HCCP process exists on the device.

    Solution:

    Wait for a while (usually after the training job script is stopped, the resource destruction and release process starts). Start the training job, after the process exits completely.

  • Service script problem (occurring inevitably): Each time the training script is started, two or more HCCP processes are started on a device.

    Possible Cause:

    Run the following command to export logs and check whether there are two or more logs about the HCCP process startup within the near interval (usually at the millisecond level):

    cd /root/ascend/log/run/
    grep -rn "hccp init" *

    As shown in the following figure, the problem occurs because the service script starts two or more HCCP processes on a device. You need to check the service script.

    Solution:

    You need to check the service script. The following are some common cases.
    1. An error is reported when hccl_test is executed on a single server with 16 devices. The -p and -n options do not match. As a result, the process is started for multiple times.

    2. The pytorch nproc_per_node parameter needs to be set to the number of training processes to be started on the node. Assume that only two devices are available, but the value of this parameter is 8. Therefore, the training processes are started for four times, causing the error. As shown in the following log, the process is repeatedly started for four times.

Possible Cause and Solution (Error Message: "certificate is not yet valid")

  • Possible Cause:

    If the device reports the error message "certificate is not yet valid," the possible cause is as follows:

    The host clock is abnormal. As a result, the system time is earlier than the validity period of the TLS certificate. When the TLS function is enabled, the certificate fails to be verified and the training fails to be started.

    In the following example, the start time of the TLS certificate validity period is 2023/09/26.

    When the HCCP process is started, an error message is displayed, indicating that the certificate is invalid and the verification fails. As a result, the HCCP process fails to be started. The system time is 2019/09/26.

  • Solution:

    After the normal time is synchronized, the error can be rectified.