No Context Is Available Due to an Error in Invoking the SetDevice Interface

Symptom

Collect the log file by referring to Collect Information About Process Interruption. The following uses ${HOME}/err_log_info/ as an example of directory for storing collected logs.

The host application log file (${HOME}/err_log_info/log/[run|debug]/plog/plog-pid_*.log) contains the keywords ctx is NULL and context pointer null. A log example is as follows:

104069:[ERROR]RUNTIME(2977549,test_incre):2024-02-21-10:24:27.965.879[api_impl.cc:5544]2978321 CtxGetSysParamOpt:report error module_type=3,
module_name=EE8888
104070:[ERROR]RUNTIME(2977549,test_incre):2024-02-21-10:24:27.965.886[api_impl.cc:5544]2978321 CtxGetSysParamOpt:ctx is null!
104072:[ERROR]RUNTIME(2977549,test_incre):2024-02-21-10:24:27.966.091[api_c.cc:5200]2978321 rtCtxGetSysParamOpt:ErrCode=107002, desc=[context pointer null],InnerCode=0x7070001
104074:[ERROR]RUNTIME(2977549,test_incre):2024-02-21-10:24:27.966.116[error_message_manage.cc:48]2978321 FuncErrorReason:rtCtxGetSysParamOpt execute
failed, reason=[context pointer null]
104080:[ERROR]RUNTIME(2977549,test_incre):2024-02-21-10:24:27.966.770[api_impl.cc:5553]2978321 CtxGetOverflowAddr:report error module_type=3,
module_name=EE8888

Fault Root Causes

An exception occurs when the interface related to the SetDevice operation is invoked. As a result, no context is available.

Corresponding interface of AscendCL: aclrtSetDevice

Corresponding interface of runtime: rtSetDevice/rtSetDeviceEx

Solution

Run the export ASCEND_GLOBAL_LOG_LEVEL=1 command to set the log level to info, execute the service again, collect logs, search for the keyword SetDevice in the plog file, check whether the following incorrect SetDevice invoking mode exists, and optimize the code logic based on the check scenario.
  • If SetDevice is found, check the following code logic:
    • Check whether the device in the operating environment is abnormal. If yes, the context cannot be created.

      Error Scenario Example: The SetDevice operation is performed, but no device exists in the operating environment.

      [root@localhost plog]# grep -rn "SetDevice"
      plog-27441_20240221060831113.log:2210:[INFO] ASCENDCL(28980,python3):2024-02-21-06:08:31.254.432 [device.cpp:148]28980 aclrtSetDevice: start to execute aclrtSetDevice, deviceId = 0
      plog-27441_20240221060831113.log:2231:[INFO] RUNTIME(28980,python3):2024-02-21-06:08:31.254.750 [api_c.cc:1798] 28980 rtSetDevice: There is no devId, do nothing.
      plog-27441_20240221060831113.log:2235:[WARNING] ASCENDCL(28980,python3):2024-02-21-06:08:31.254.818 [device.cpp:157]28980 aclrtSetDevice: update platform info with device failed, deviceId = 0
      plog-27441_20240221060831113.log:2236:[INFO] ASCENDCL(28980,python3):2024-02-21-06:08:31.254.828 [device.cpp:160]28980 aclrtSetDevice: successfully execute aclrtSetDevice, deviceId = 0
      plog-27441_20240221060831113.log:6861:[INFO] RUNTIME(28980,python3):2024-02-21-06:08:31.529.819 [api_c.cc:1798] 29157 rtSetDevice: There is no devId, do nothing.
      plog-27441_20240221060831113.log:11191:[INFO] RUNTIME(28980,python3):2024-02-21-06:08:40.390.084 [api_c.cc:1798] 29157 rtSetDevice: There is no devId, do nothing.
      plog-27441_20240221060831113.log:36695:[INFO] RUNTIME(28980,python3):2024-02-21-06:08:58.721.291 [api_c.cc:1798] 28980 rtSetDevice: There is no devId, do nothing.
      plog-27441_20240221060831113.log:36738:[INFO] RUNTIME(28980,python3):2024-02-21-06:08:58.729.533 [api_c.cc:1798] 28980 rtSetDevice: There is no devId, do nothing.
      plog-27441_20240221060831113.log:36781:[INFO] RUNTIME(28980,python3):2024-02-21-06:08:58.736.987 [api_c.cc:1798] 28980 rtSetDevice: There is no devId, do nothing.
      plog-27441_20240221060831113.log:36796:[INFO] RUNTIME(28980,python3):2024-02-21-06:08:58.754.905 [api_c.cc:1798] 28980 rtSetDevice: There is no devId, do nothing.
      plog-27441_20240221060831113.log:561:[INFO] RUNTIME(30104,host_cpu_executor):2024-02-21-06:08:41.913.634 [api_c.cc:1798] 30104 rtSetDevice: There is no devId, do nothing.
      plog-27441_20240221060831113.log:9522:[INFO] HCCL(30104,host_cpu_executor):2024-02-21-06:08:55.769.345 [hccl_impl_base.cc:2644] [30104][SetDevice] entry
      plog-27441_20240221060831113.log:9526:[INFO] HCCL(30104,host_cpu_executor):2024-02-21-06:08:55.769.494 [hccl_impl_base.cc:2622] [31143][SetDeviceThread]ctx[(nil)]
      plog-27441_20240221060831113.log:9527:[INFO] HCCL(30104,host_cpu_executor):2024-02-21-06:08:55.769.497 [hccl_impl_base.cc:2637] [31143][SetDeviceThread]exit
      plog-27441_20240221060831113.log:9528:[INFO] HCCL(30104,host_cpu_executor):2024-02-21-06:08:55.769.527 [hccl_impl_base.cc:2657] [30104][SetDevice]exit
      [root@localhost plog]# grep -rn "ERROR"
      plog-27441_20240221060831113.log:36899:[ERROR] RUNTIME(28980,python3:2024-02-21-06:08:59.095.527) [api_impl.cc:1038]28980 GetMaxStreamAndTask:[FINAL][FINAL]report error module_type=3, module_name=EE8888
      plog-27441_20240221060831113.log:36900:[ERROR] RUNTIME(28980,python3:2024-02-21-06:08:59.095.531) [api_impl.cc:1038]28980 GetMaxStreamAndTask:[FINAL][FINAL]ctx is NULL!
      plog-27441_20240221060831113.log:36902:[ERROR] RUNTIME(28980,python3:2024-02-21-06:08:59.095.625) [logger.cc:417]28980 GetMaxStreamAndTask:[FINAL][FINAL]GetMax stream and task failed, streamType=0.
      plog-27441_20240221060831113.log:36903:[ERROR] RUNTIME(28980,python3:2024-02-21-06:08:59.095.668) [api_c.cc:850]28980 rtGetMaxStreamAndTask:[FINAL][FINAL]ErrCode=107002, desc[contextpointer null], InnerCode=0x7070001
      plog-27441_20240221060831113.log:36904:[ERROR] RUNTIME(28980,python3:2024-02-21-06:08:59.095.675) [error_message_manage.cc:48]28980 FuncErrorReason:[FINAL][FINAL]report error module_name=EE1001
      plog-27441_20240221060831113.log:36905:[ERROR] RUNTIME(28980,python3:2024-02-21-06:08:59.095.683) [error_message_manage.cc:48]28980 FuncErrorReason:[FINAL][FINAL]rtGetMaxStreanAndTask execute failed, reason=[context pointer null]
    • In the multi-thread scenario, check whether the SetDevice operation thread is the same as the thread that reports the error. Check whether the SetDevice operation is performed only for the main thread (thread ID: 2977549) and not for the sub-thread (thread ID: 2978321). If yes, no context is available in the subthread.

      Error Scenario Example: SetDevice is performed only for the main thread and not for other thread. As a result, an error is reported.

      [root@localhost plog]# grep -rn "ERROR"
      104069:[ERROR] RUNTIME(2977549, test_incre):2024-03-21-10:24:27.965.879  [api_impl.cc:5544]2978321
      CtxGetSysParamOpt:report error module_type=3, module_name=EE8888
      104070:[ERROR] RUNTIME(2977549, test_incre):2024-03-21-10:24:27.965.886 [api_impl.cc:5544]2978321 CtxGetSysParamOpt:ctx is NULL!
      104072:[ERROR] RUNTIME(2977549, test_incre):2024-03-21-10:24:27.966.091 [api_c.cc:5200]2978321 rtCtxGetSysParamOpt:ErrCode=107002, desc[context pointer null], InnerCode=0x7070001
      104073:[ERROR] RUNTIME(2977549, test_incre):2024-03-21-10:24:27.966.104 [error_message_manage.cc:48]2978321 FuncErrorReason:report error module_name=EE1001
      104074:[ERROR] RUNTIME(2977549, test_incre):2024-03-21-10:24:27.966.116 [error_message_manage.cc:48]2978321 FuncErrorReason:rtCtxGetSysParamOpt execute failed, reason=[context pointer null]
      104080:[ERROR] RUNTIME(2977549, test_incre):2024-03-21-10:24:27.966.770  [api_impl.cc:5553]2978321
      CtxGetOverflowAddr:report error module_type=3, module_name=EE8888
      104081:[ERROR] RUNTIME(2977549, test_incre):2024-03-21-10:24:27.966.776  [api_impl.cc:5553]2978321 CtxGetOverflowAddr:ctx is NULL!
      104083:[ERROR] RUNTIME(2977549, test_incre):2024-03-21-10:24:27.966.832 [api_c.cc:5210]2978321 rtCtxGetOverflowAddr:ErrCode=107002, desc[context pointer null], InnerCode=0x7070001
      104084:[ERROR] RUNTIME(2977549, test_incre):2024-03-21-10:24:27.966.840 [error_message_manage.cc:48]2978321 FuncErrorReason:report error module_name=EE1001
      104085:[ERROR] RUNTIME(2977549, test_incre):2024-03-21-10:24:27.966.847 [error_message_manage.cc:48]2978321 FuncErrorReason:rtCtxGetOverflowAddr execute failed, reason=[context pointer null]
      104087:[ERROR] OP(2977549, test_incre):2024-03-21-10:24:27.966.882 [nnopbase_executor.cpp:73][NNOP][NnopbaseSetOverFlowAddr][2978321] errno[361001] Assert ((rtCtxGetOverflowAddr(&addr)) == 0) failed
      104089:[ERROR] OP(2977549, test_incre):2024-03-21-10:24:27.968.636 [nnopbase_executor.cpp:216][NNOP][NnopbaseSetExecutorSetGlobalConfig][2978321]
      errno[361001] Check NnopbaseSetOverFlowAddr(g_nnopbaseSysCfgParams.overflowAddr) failed
      104091:[ERROR] OP(2977549, test_incre):2024-03-21-10:24:27.968.800 [nnopbase_api.cpp:44][NNOP][NnopbaseInit][2978321] errno[361001] Check
      NnopbaseExecutorSetGlobalConfig() failed
      104093:[ERROR] OP(2977549, test_incre):2024-03-21-10:24:27.968.822 [nnopbase_api.cpp:53][NNOP][NnopbaseCreateExecutorSpace][2978321]  errno[361001] Assert ((NnopbaseInit()) == 0) failed
      104095:[ERROR] OP(2977549, test_incre):2024-03-21-10:24:27.968.841 [nnopbase_api.cpp:19][NNOP][NnopbaseOpLogE][2978321] errno[361001] Check
      NnopbaseCreateExecutorSpace(&executorSpace) failed
    • Check whether SetDevice is invoked after an error is reported. If yes, no context is available.

      Error Scenario Example: An error is reported because the SetDevice interface is not invoked before memory application.

      [ERROR] RUNTIME(44525,python) :2024-01-22-04:26:23.625.738 [api_impl.cc:1401]45179 DevMalloc: [DUMP] [DEFAULT] report error module_type=3, module_name=EE8888
      [ERROR] RUNTIME(44525,python) :2024-01-22-04:26:23.625.743 [api_impl.cc:1401]45179 DevMalloc: [DUMP] [DEFAULT] ctx is NULL!
      [INFO] GE(44525,python) :2024-01-22-04:26:23.625.761 [error_manager.cc:296]45179 ReportInterErrMessage:report error message, error_code:EE8888, work_stream_id:4452545179
      [ERROR] RUNTIME(44525,python) :2024-01-22-04:26:23.625.784 [logger.cc:581]45179 DevMalloc:[DUMP][DEFAULT]Device malloc failed, size=9(Byte), type=2.
      [ERROR] RUNTIME(44525,python) :2024-01-22-04:26:23.625.805 [api_c.cc:1173]45179 rtMalloc:[DUMP][DEFAULT]ErrCode=107002, desc=[context pointer null], InnerCode=0x7070001
      [ERROR] RUNTIME(44525,python) :2024-01-22-04:26:23.625.811 [error_message_manage.cc:48]45179 FuncErrorReason:[DUMP][DEFAULT] report error module_name=EE1001
      [ERROR]RUNTIME(44525,python) :2024-01-22-04:26:23.625.816 [error_message_manage.cc:48]45179 FuncErrorReason:[DUMP][DEFAULT] rtMallocexecute failed, reason=[context pointer null]
  • If SetDevice is not found, the SetDevice operation is not performed. As a result, no context is available.