您可以按如下步骤定位问题,若无法解决问题,再联系技术支持。您可以获取日志后单击Link联系技术支持。
准备阶段,需收集CANN日志文件。如何CANN日志文件,请参见收集进程卡住问题信息。收集的日志所存放的目录,下文以${HOME}/err_log_info/为例。
# 先执行gdb命令,pid表示卡住的用户进程ID,请根据实际情况替换 gdb -p pid # 再查看调用栈 (gdb)bt
CANN内部函数来源于CANN软件安装目录下的库文件,CANN软件默认安装目录为“usr/local/Ascend”。
[INFO][pid:2241703][awatchdog_monitor.c:145](tid:2245407) watchdog timeout, dogId 65584, timeout : 3s
若由于算子编译时长问题导致进程超时退出,或由上层业务框架终止进程,这时可以通过trace日志确认是否存在算子编译时间太长导致进程卡住:
# 在trace日志目录下,查找记录算子编译信息的日志文件 find <trace日志目录> -name "schedule_tracer_FE_Global_Trace.txt" find <trace日志目录> -name "schedule_tracer_FE_Statistics_Trace.txt" find <trace日志目录> -name "schedule_tracer_FE_CompileTd_*.txt"
trace日志中的关键字说明:
该日志中存在“Compile process status.”关键字表示算子在编译中,存在“Finish subgraph compile”关键字表示算子编译结束。若某个算子停留在“Compile process status.”状态,还需要结合schedule_tracer_FE_CompileTd_*.txt日志查看该算子具体处理哪个处理阶段。
日志片段示例如下:
atrace/trace_160081_160353_20240614161350257811/schedule_event_160353_20240614161504656402/schedule_tracer_FE_Global_Trace.txt 2024-06-14 16:13:53.548.098 Begin to initialize TeFusion. 2024-06-14 16:13:53.549.392 Dfx manager has been initialized. 2024-06-14 16:13:58.854.521 CannKb has been initialized. 2024-06-14 16:13:58.890.273 Multi process has been checked. 2024-06-14 16:13:58.892.068 Cache manager has been initialized. 2024-06-14 16:13:59.785.505 TbeCompiler has been initialized. 2024-06-14 16:14:00.399.061 Parallel Compilation has been initialized. 2024-06-14 16:14:00.399.368 PythonApiCall has been initialized. 2024-06-14 16:14:00.400.672 TeFusion has been initialized successfully. 2024-06-14 16:14:38.536.555 Compile process status. ThreadId:139857490806528|Total task cout:1|Finished task cout:1|Waiting task cout:0 2024-06-14 16:14:38.536.718 ThreadId:139857490806528|SingleOp:Assign_2|Task wait second:12 2024-06-14 16:14:38.536.753 Finish subgraph compile. ThreadId:139857490806528|Total task cout:1 2024-06-14 16:15:19.548.089 Compile process status. ThreadId:139857456973568|Total task cout:6|Finished task cout:2|Waiting task cout:4 2024-06-14 16:15:19.548.229 ThreadId:139857456973568|SingleOp:Sub_18|Task wait second:14 2024-06-14 16:15:40.628.425 Compile process status. ThreadId:139857456973568|Total task cout:6|Finished task cout:4|Waiting task cout:2 2024-06-14 16:15:40.628.583 ThreadId:139857456973568|SingleOp:GatherV2_10|Task wait second:6 2024-06-14 16:16:01.959.390 Compile process status. ThreadId:139857456973568|Total task cout:6|Finished task cout:6|Waiting task cout:0 2024-06-14 16:16:01.959.541 ThreadId:139857456973568|SingleOp:Pow_16|Task wait second:7 2024-06-14 16:16:01.959.579 Finish subgraph compile. ThreadId:139857456973568|Total task cout:6 2024-06-14 16:16:32.408.784 Compile process status. ThreadId:139857456973568|Total task cout:2|Finished task cout:2|Waiting task cout:0 2024-06-14 16:16:32.408.943 ThreadId:139857456973568|SingleOp:Pow_2|Task wait second:12 2024-06-14 16:16:32.409.027 Finish subgraph compile. ThreadId:139857456973568|Total task cout:2 2024-06-14 16:16:43.550.284 PreBuildManager has been finalized. 2024-06-14 16:16:43.550.286 Begin to finalize TeFusion. 2024-06-14 16:16:43.559.052 ParallelCompilation has been finalized. 2024-06-14 16:16:43.591.007 Files under kernel meta dir and kernel meta temp dir has been removed. 2024-06-14 16:16:43.808.544 CannKb has been finalized. 2024-06-14 16:16:43.808.737 Python object has been reset. 2024-06-14 16:16:43.823.564 TeFusion has been finalized successfully.
一般来说,若算子的状态停留在“init disk cache”状态,表示算子编译卡住,若进入“reuset task cache”、“dispatch online compile”、“reuse disk cache”、“reuse binary”状态,表明算子编译正常进行中。
日志片段示例:
cat atrace/trace_160081_160353_20240614161350257811/schedule_event_160353_20240614161504666001/schedule_tracer_FE_CompileTd_8.txt 2024-06-14 16:14:35.340.781 Compile process detail:Thread Id:139857456973568|Op Id:10|Op Type:GatherV2|get mutex lock 2024-06-14 16:14:35.340.846 Compile process detail:Thread Id:139857456973568|Op Id:10|Op Type:GatherV2|get gil lock 2024-06-14 16:14:35.340.876 Compile process detail:Thread Id:139857456973568|Op Id:10|Op Type:GatherV2|init disk cache 2024-06-14 16:14:35.346.571 Compile process detail:Thread Id:139857456973568|Op Id:10|Op Type:GatherV2|dispatch online compile ...... 2024-06-14 16:15:52.991.561 Compile process detail:Thread Id:139857456973568|Op Id:4|Op Type:Xdivy|get mutex lock 2024-06-14 16:15:52.991.674 Compile process detail:Thread Id:139857456973568|Op Id:4|Op Type:Xdivy|get gil lock 2024-06-14 16:15:52.991.706 Compile process detail:Thread Id:139857456973568|Op Id:4|Op Type:Xdivy|init disk cache 2024-06-14 16:15:52.996.765 Compile process detail:Thread Id:139857456973568|Op Id:4|Op Type:Xdivy|reuse binary
算子编译各阶段的状态一般包括:
atrace/trace_160081_160353_20240614161350257811/schedule_event_160353_20240614161504656977/schedule_tracer_FE_Statistics_Trace.txt 2024-06-14 16:16:43.550.032 Disk cache statistics:match times:9|reuse fail times:9|copy times:8|copy success times:7|copy fail times:1 2024-06-14 16:16:43.550.043 Disk cache statistics:cache not existed times:9 2024-06-14 16:16:43.550.044 Online compile statistics:compile task submit times:7 2024-06-14 16:16:43.550.044 Binary reuse statistics:match times:9|reused times:3|reuse fail times:1|reuse check fail times:7 2024-06-14 16:16:43.550.044 Binary reuse statistics:simple key mismatch times:1
栈信息片段示例如下:
栈信息片段示例如下:
trace日志(schedule_tracer_*.txt文件)报错示例如下:
[ERROR] TSCH(-1,null):2024-02-08-01:41:14.977.406 3986 (dieid:0,cpuid:0) stars_timewheel.c:90 stars_wait_sqe_is_timeout: Timeout: sq_pid=80029, sq_id=3, is_need_process=1, exe_time=16, time_out=10, stream_id=3, task_id=47, cur_head=47, wait_sqe_head=47. [ERROR] TSCH(-1,null):2024-02-08-01:41:14.977.431 3987 (dieid:0,cpuid:0) stars_interrupt.c:1221 stars_proc_wait_timeout: wait task timeout, rtsq_id=3, type=5, sq_pid=80029
日志示例及解析如下:
################################################################ stream_id=3, rtsq_id=3, sq_fsm=0x9, cur_wait_time=16, head=47, tail=49, is_model=0, model_id=65535, exe_times=65535, sq_en=1, swap_in=0, cq_head=10, cq_tail=10, pid=80029, vf_id=0, is_mc2_sq=0 stream_id=3, overflow_en=0, ge_op_tag=4294967295, has_overflow=0 swapbuff[0] 0x 59080000 0000001f 000d001f 00030201 00000000 0001389d 010007ff 00000000 swapbuff[8] 0x 3e5c8000 00000007 000107ff 00000000 00000000 0000d555 00000052 00000000 sqe[0] 0x 00004003 00278003 0000ffff 00fe0000 00000000 00000000 00000000 00000000 sqe[8] 0x 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 pos=39: stream_id=32771, task_id=39, sqe_type=3, post_p=0 sqe[0] 0x 00010000 00280003 00001000 00fe0000 40400000 00001241 00000000 00000004 sqe[8] 0x c00197c0 00001240 8000a000 00001241 00000000 00000000 00000000 00000000 pos=40: stream_id=3, task_id=40, sqe_type=0, FFTS sqe[0] 0x 00000004 00290003 00000008 00fe0000 00000000 00000000 00000000 00000000 sqe[8] 0x 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 pos=41: stream_id=3, task_id=41, sqe_type=4, Event Record, event_id=8, isSync=0 sqe[0] 0x 00010000 002a0003 00001000 00fe0000 40400000 00001241 00000000 00000004 sqe[8] 0x c00197c0 00001240 8000a800 00001241 00000000 00000000 00000000 00000000 pos=42: stream_id=3, task_id=42, sqe_type=0, FFTS sqe[0] 0x 00000005 002b0003 00000009 00ff0000 00000000 0000000a 00000000 00000000 sqe[8] 0x 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 pos=43: stream_id=3, task_id=43, sqe_type=5, Event Wait, event_id=9 sqe[0] 0x 00010000 002c0003 00001000 00fe0000 40400000 00001241 00000000 00000004 sqe[8] 0x c00197c0 00001240 8000b000 00001241 00000000 00000000 00000000 00000000 pos=44: stream_id=3, task_id=44, sqe_type=0, FFTS sqe[0] 0x 00010000 002d0003 00001000 00fe0000 40400000 00001241 00000000 00000004 sqe[8] 0x c00197c0 00001240 8000b400 00001241 00000000 00000000 00000000 00000000 pos=45: stream_id=3, task_id=45, sqe_type=0, FFTS sqe[0] 0x 00010000 002e0003 00001000 00fe0000 40400000 00001241 00000000 00000004 sqe[8] 0x c00197c0 00001240 8000b800 00001241 00000000 00000000 00000000 00000000 pos=46: stream_id=3, task_id=46, sqe_type=0, FFTS sqe[0] 0x 00000005 002f0003 0000000b 00ff0000 00000000 0000000a 00000000 00000000 sqe[8] 0x 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 pos=47: stream_id=3, task_id=47, sqe_type=5, Event Wait, event_id=1 sqe[0] 0x 00004003 00308003 0000ffff 00fe0000 00000000 00000000 00000000 00000000 sqe[8] 0x 00000000 00000000 00000000 00000000 00000000 00000000 00000000 00000000 pos=48: stream_id=32771, task_id=48, sqe_type=3, post_p=0 ################################################################
在trace日志(schedule_tracer_*.txt文件)中通过event_id,例如执行grep -rn "event_id=1"命令搜索关键字,找到下发Event Record任务的流,该流的record任务没执行,导致另一条流的event wait任务卡住。定位到任务卡住的原因后,需联系技术支持进一步排查软件问题或用户调用逻辑问题。
日志示例及解析如下:通过event_id=1,找到对应的event_record任务所在的stream 2,通过日志中的信息发现stream 2上的任务执行停留在任务0(即pos[0]任务),导致任务卡住,这是因为该流的event_record任务(即pos[1])没有执行,进而导致其它流卡住。