How Do I Address Network Inference Failure Due to AI Core Operator Execution Timeout?

Symptom

During model inference, a message is displayed indicating that the model fails to be executed. The following ERROR level logs are generated on the device.

[ERROR] TSCH(-1,null):2020-06-04-10:51:09.520.395 28 (cpuid:0) ai_core_dispatcher.c:1012 bs_done_exception_proc_timeout: slot_id=1,TS_ctrl=0x4,exception_core_list=0x0,current core usage=0x1,AI_CORES_COUNT=2, fault_task=0

Cause Analysis

When an operator execution task on the AI Core times out, the Task Scheduler returns a timeout failure.

The timeout interval for the Atlas 200/300/500 Inference Product is 55s.

The timeout interval for the Atlas Training Series Product is 68s.

Troubleshooting

Perform the following steps to locate the timeout operator:

  1. Check the logs on the device to locate the ID of the failed TSCH component task.

    For details about the device-side log directories and viewing methods, see Log Reference.

    Query the log message of the TSCH component by searching for the keyword bs_done_exception_proc_timeout, to locate the fault_task ID (for example, fault_task=0).

    For details about the log message, see Symptom.

  2. Check the logs on the host to locate the name of the error operator.

    For details about the host-side log directories and viewing methods, see Log Reference.

    Search for the log based on the task ID (for example, 0) obtained in 1 by entering TaskLaunched and task_id=0 in the host-side log file as follows.

    [EVENT] RUNTIME(15568,acl_caffe_interp):2020-06-04-10:50:14.522.076 [runtime/feature/src/logger.cc:1014]15570 TaskLaunched:device_id=0, stream_id=514, sq_id=514, task_id=0, kernel_name=test_case/2_16_144_417_248_408_float16/0_Interp_1_0_2_16_144_417_0_0_2_16_248_408.om/Interp_tvmbin, devfunc_name = te_interp_1ead9f4957880f1e_0__kernel0 task_type=AiCoreKernel, task_launched_num=2

    In the preceding log message, Interp of kernel_name indicates the operator that times out.

    You can perform operator tuning by following the instructions described in Performance Optimization in TIK Mode.