asys Tool Usage Guide (EP Mode)
Prerequisite
You have installed the toolkit package in the CANN operating environment. For details, see CANN Software Installation Guide.
Before using the asys tool, log in to the environment as the installation user, run the source ${install_path}/latest/bin/setenv.bash command to set environment variables, and directly enter the asys command without entering the full path of the asys tool (that is, python3 ${install_path}/latest/toolkit/tools/ascend_system_advisor/asys/asys.py). ${install_path} indicates the installation directory of the software package, for example, /usr/local/Ascend/ascend-toolkit.
Fault Information Collection
- Command:
asys collect --task_dir=path1 --tar="True" --output=path2
- Parameters:
- task_dir (optional): It specifies the directory for collecting operator compilation files (including .o and .json files) and dump files (including GE dump graphs, TF Adapter dump graphs, and exception dump files).
If task_dir is specified, the tool preferentially searches for L0 exception dump files. If the files exist, the tool collects the L0 exception dump files and operator compilation files. If the files do not exist, the tool collects operator compilation files and dump files from the path specified by task_dir.
If task_dir is not specified, the tool searches for L0 exception dump files. If the files exist, the tool collects the L0 exception dump files and operator compilation files.
Ensure that the ${NPU_COLLECT_PATH} environment variable is not configured. Otherwise, the tool only collects L1 exception dump information but not L0 exception dump information.
- tar (optional): It controls whether to compress the result output directory of the asys tool into a *.tar.gz file. If this parameter is set to T or True, the result output directory is compressed into a *.tar.gz file and the original directory is not retained. If this parameter is set to F or False, the directory is not compressed into a *.tar.gz file. By default, the directory is not compressed. The parameter value is case insensitive.
- output (optional): Its value is used as the prefix of the result output directory of the asys tool. That is, the final output directory is {output}/asys_output_timestamp. If the command does not contain the output parameter, the output is stored in the command execution directory. If the value of output is empty or invalid, the specified directory does not have the write permission, or the directory fails to be created, the asys tool exits and reports an error.
After the command is executed, the fault information files in {output}/asys_output_timestamp are as follows:
├── asys_output_timestamp ├── software_info.txt // Installation package version, environment variables, dependent software, and system information ├── hardware_info.txt // Collecting host and device hardware information. The host information includes the kernel version, CPU model, memory usage, and disk usage. The device information includes the number of devices and the number of AI CPUs. ├── status_info.txt // Collecting device information, including the chip model, CPU usage, and AI Core usage. ├── health_result.txt // Collecting device health information, including fault codes and fault information. └── dfx ├── bbox // Black box information on the device side ├── data-dump // Generating L0 exception dump files when an AI Core error occurs in L0. ├── graph // Indicating the dump graph information, including the dump graphs of the GE and TF Adapter. L0 exception dump does not collect this information. ├── ops // Operator compilation information, including the operator compilation *.o and *.json files and custom operator configuration information and more. ├── stackcore // Information about the core file generated when a core dump is triggered by an error. ├── atrace // Indicating the trace flush information, including the plaintext file parsed from the trace binary file. └── log ├── device │ ├──dev-os-{id} │ ├── firmware // Logs generated by the firmware │ ├── slogd // Maintenance and test logs of log-related processes │ ├── application // The non-event application logs generated by the service process. │ └── system // Logs generated by resident process └── host ├── message // message/syslog ├── install // Logs about the history installation of the package ├── cann // Host-side application logs └── driver // Driver logs on the HostUsers can customize information about versions of third-party dependencies collected in the software_info.txt file as required. In the ascend_system_advisor/conf/dependent_package.csv file in the asys tool directory, add or delete configuration items. Each line corresponds to a configuration item. Use commas (,) to separate dependency item names and query commands. There should be no space after each comma. The following is a sample code snippet:
make,make --version cmake,cmake --version unzip,unzip -v zlib1g,dpkg -l zlib1g| grep zlib1g| grep ii zlib1g-dev,dpkg -l zlib1g-dev| grep zlib1g-dev| grep ii libsqlite3-dev,dpkg -l libsqlite3-dev| grep libsqlite3-dev| grep ii openssl,dpkg -l openssl| grep openssl| grep ii libssl-dev,dpkg -l libssl-dev| grep libssl-dev| grep ii libffi-dev,dpkg -l libffi-dev| grep libffi-dev| grep ii
- task_dir (optional): It specifies the directory for collecting operator compilation files (including .o and .json files) and dump files (including GE dump graphs, TF Adapter dump graphs, and exception dump files).
Service Re-run and Fault Information Collection
By default, the function of collecting operator compilation files, GE dump graphs, and TF Adapter dump graphs is enabled during service re-run.
- When you run the asys launch command, the following environment variables are automatically enabled to temporarily store the collected information. After the asys launch command is executed, the environment variables are automatically disabled. If you manually set these environment variables before running the asys launch command, the settings of the environment variables will be overwritten and do not take effect. If the script of the re-run task includes these environment variables, the environment variables set by the asys tool may be overwritten and do not take effect. As a result, the corresponding information cannot be collected.
- NPU_COLLECT_PATH: specifies the path for storing the dump graph and operator compilation .o file. Only L1 exception dump information is collected. For details about the description and restrictions, see NPU_COLLECT_PATH in Environment Variables.
- ASCEND_PROCESS_LOG_PATH: It specifies the path for storing host application logs. For details about the description and restrictions, see ASCEND_PROCESS_LOG_PATH in Environment Variables.
- ASCEND_HOST_LOG_FILE_NUM: It specifies the number of log files of each process in the host application log directory. The value is 1000. For details about the description and restrictions, see ASCEND_HOST_LOG_FILE_NUM in Environment Variables.
- ASCEND_WORK_PATH: It specifies the path for storing trace logs. For details about the description and restrictions, see ASCEND_WORK_PATH in Environment Variables.
- The asys launch command starts the subprocess to execute the service command. If you stop the launch command, the service subprocess may not exit. In this case, you need to stop the service subprocess.
- (Optional) Modify the configuration options related to service re-run.
The default configurations are as follows. You can modify specific parameters in the ascend_system_advisor/conf/asys.ini file in the asys tool directory to enable or disable the functions of collecting operator compilation files and dump graphs.
[launch] graph = TRUE // Specifying whether to collect graph information. The value range is as follows: TRUE: collect; FALSE: not collect. If this parameter is set to FALSE, the settings of dump_ge_graph and dump_graph_level do not take effect. ops = TRUE // Specifying whether to collect operator compilation information. The value range is as follows: TRUE: collect; FALSE: not collect dump_ge_graph = 2 // Specifying the content of the dump graph. The value is 2. A dump of basic version and does not contain data such as weights. The corresponding environment variable is DUMP_GE_GRAPH. dump_graph_level = 3 // Specifying the number of the dump graph. The value is 3. Graphs generated in the last dump phase correspond to the environment variable DUMP_GRAPH_LEVEL. log_level = INFO // Indicating the global log level for application logs and the log level for each module, corresponding to the environment variable ASCEND_GLOBAL_LOG_LEVEL. log_event_enable = TRUE // Determining whether to enable the event log for application logs. The value range is as follows: TRUE: enabled; FALSE: disabled. The corresponding environment variable is ASCEND_GLOBAL_EVENT_ENABLE. log_print_to_stdout = FALSE // Determining whether to enable the log display. The value range is as follows: TRUE: enabled; FALSE: disabled. The corresponding environment variable is ASCEND_SLOG_PRINT_TO_STDOUT.
By default, the environment variables upon asys startup use the values configured in this file. If these environment variables are set to other values in the script of the re-run task, the environment variables will be overwritten, and the values set in the re-run task script will be used. As a result, the collected maintenance and test information may not meet the fault locating requirements.
- Command:
asys launch --task="sh ../app_run.sh" --tar="True" --output=path
- Parameters:
- task (required): It indicates the execution command of a re-run task, which must be a complete command, for example, sh ../app_run.sh. sh is the command for executing the task, and ../app_run.sh is the task script to be executed.
The original script cannot be directly executed in the background. For example, the original cases are executed by the sh cmd.sh command. However, in the implementation of cmd.sh, python3 test.py & is included to execute in the background. This task is not supported because the end point cannot be detected.
- tar (optional): It controls whether to compress the result output directory of the asys tool into a *.tar.gz file. If this parameter is set to T or True, the result output directory is compressed into a *.tar.gz file and the original directory is not retained. If this parameter is set to F or False, the directory is not compressed into a *.tar.gz file. By default, the directory is not compressed. The parameter value is case insensitive.
- output: This parameter is optional. Its value is used as the path prefix of the result output directory of the asys tool. So the final output directory is {output}/asys_output_timestamp. If the command does not contain the output parameter, the output is stored in the command execution directory. If the value of output is empty or invalid, the specified directory does not have the write permission, or the directory fails to be created, the asys tool exits and reports an error.
After the command is executed, the fault information files in {output}/asys_output_timestamp are as follows:
├── asys_output_timestamp ├── software_info.txt // Installation package version, environment variables, dependent software, and system information ├── hardware_info.txt // Collecting host and device hardware information. The host information includes the kernel version, CPU model, memory usage, and disk usage. The device information includes the number of devices and the number of AI CPUs. ├── status_info.txt // Collecting device information, including the chip model, CPU usage, and AI Core usage. ├── health_result.txt // Collecting device health information, including fault codes and fault information. └── dfx ├── bbox // Black box information on the device side ├── data-dump // Dump file generated when an AI Core error occurs. ├── graph // The dump graph information, including the dump graphs of the GE and TF Adapter. ├── ops // Operator compilation information, including operator compilation *.o and *.json files, operator compilation process information, and custom operator configuration information ├── stackcore // Information about the core file generated when a core dump is triggered by an error. ├── atrace // Indicating the trace flush information, including the plaintext file parsed from the trace binary file. └── log ├── device │ ├──dev-os-{id} │ ├── firmware // Logs generated by the firmware │ ├── slogd // Maintenance and test logs of log-related processes │ ├── application // The non-event application logs generated by the service process. │ └── system // Logs generated by resident process └── host ├── message // message/syslog ├── install // Logs about the history installation of the package ├── cann // Host-side application logs ├── driver // Driver logs on the Host ├── screen.txt // Printed logs (If the content is empty, redirection may be set in the application task.) └── user_cmd // Command used by a user to execute a taskUsers can customize information about versions of third-party dependencies collected in the software_info.txt file as required. In the ascend_system_advisor/conf/dependent_package.csv file in the asys tool directory, add or delete configuration items. Each line corresponds to a configuration item. Use commas (,) to separate dependency item names and query commands. There should be no space after each comma. The following is a sample code snippet:
make,make --version cmake,cmake --version unzip,unzip -v zlib1g,dpkg -l zlib1g| grep zlib1g| grep ii zlib1g-dev,dpkg -l zlib1g-dev| grep zlib1g-dev| grep ii libsqlite3-dev,dpkg -l libsqlite3-dev| grep libsqlite3-dev| grep ii openssl,dpkg -l openssl| grep openssl| grep ii libssl-dev,dpkg -l libssl-dev| grep libssl-dev| grep ii libffi-dev,dpkg -l libffi-dev| grep libffi-dev| grep ii
- task (required): It indicates the execution command of a re-run task, which must be a complete command, for example, sh ../app_run.sh. sh is the command for executing the task, and ../app_run.sh is the task script to be executed.
Displaying Software, Hardware, And Device Status Information.
- Command:
asys info -r="status" -d=deviceId
- Parameters:
- r (required): It specifies the type of information to be displayed. The value range is as follows:
- status: It displays device information, including the chip model, temperature, health status, CPU, and AI Core information.
- software: It displays host software information, including the system and kernel versions, and CANN package versions.
- hardware: It displays the hardware information of the host and device, including the CPU model and the number of cores, memory capacity, and disk capacity of the host, the number of NPUs and model of the device, and the number of AI CPUs, AI Cores, and AI Vectors.
- d (optional): It specifies the ID of the device whose information is to be displayed. If this parameter is not specified, the information of device 0 is displayed by default. This parameter is valid only when -r=status.
- r (required): It specifies the type of information to be displayed. The value range is as follows:
Checking System Health
- Command:
asys health -d=deviceId
- Parameters:
- d (optional): It specifies the ID of the device whose health status is to be displayed. If no device is specified, the health status of all devices is displayed. When device is specified, if the device is abnormal, the error code and error information are displayed on the terminal screen. Only the first five groups of faults are displayed. All fault codes and fault information are written into the health_result.txt file in Fault Information Collection and Service Re-run and Fault Information Collection.
- Effect display:
- If no device is specified, all devices are normal. The following uses dual cards as an example:
asys health +------------------------+------------------------------+ | Group of 2 Device | Overall Health: Healthy | +========================+==============================+ | Device ID: 0 | Healthy | +------------------------+------------------------------+ | Device ID: 1 | Healthy | +------------------------+------------------------------+
- Specify the device. The device is normal. The following uses device 0 as an example:
asys health -d=0 +-------------------+------------------------------+ | Device ID: 0 | Overall Health: Healthy | | | ErrorCode Num: 0 | +===================+==============================+
- The specified device is abnormal. The following uses device 0 as an example:
asys health -d=0 +-------------------+------------------------------+ | Device ID: 0 | Overall Health: Warning | | | ErrorCode Num: 1 | +===================+==============================+ | 0xa419321c | lp pmbus error | +-------------------+------------------------------+
You can click here to search for Black Box Error Code Information List and Health Management Fault Definition of the corresponding product to view the detailed description of error codes.
- If no device is specified, all devices are normal. The following uses dual cards as an example:
Comprehensive Check
You must run the commands related to comprehensive check as the root user on a physical machine.
- Command:
# AI Core stress test, which may take a long time asys diagnose -r=stress_detect -d=deviceId --output=path # HBM detection asys diagnose -r=hbm_detect -d=deviceId --timeout=num --output=path # CPU detection asys diagnose -r=cpu_detect -d=deviceId --timeout=num --output=path
- Parameters:
- r (required): It indicates the detection mode. The values are as follows:
- stress_detect: AI Core stress test
Executing this function involves operator execution. Therefore, you need to install the operator binary file package (Ascend-cann-kernels-*_linux.run) in the environment in advance.
AI Core stress test involves voltage adjustment on the device side. When the stress test is complete, the voltage can be automatically restored. However, when the stress test exits abnormally, the voltage cannot be automatically restored. In this case, you can manually restore the voltage based on the asys environment configuration. You are advised to obtain the voltage before and after the AI Core stress test to check whether the voltage is abnormal and whether the voltage needs to be restored. For details about how to obtain and restore the voltage, see Environment Configuration.
Display of the detection result:- If no device is specified but there is only one device, only the status of this device is displayed.
- If the status of all devices is Pass or Warn, Pass - All or Warn - All is displayed.
- If the status of devices is inconsistent, the status of each device is displayed in sequence. For example, if there are four devices, Pass, Warn, Warn, and Warn are displayed.
- If the detection result is Warn, the detection fails. You can view the plog on the host, locate and rectify the fault based on the error code in the log file. Error codes starting with 1 indicate that the test case fails to be executed or the task fails to be delivered. Error codes starting with 2 indicate that the accuracy comparison fails. Error codes starting with 3 indicate hardware problems.
- If the detection result is Pass, the detection is successful.
- hbm_detect: HBM detection
Display of the detection result:
- If no device is specified but there is only one device, only the status of this device is displayed.
- If the status of all devices is Pass or Warn, Pass - All or Warn - All is displayed.
- If the status of devices is inconsistent, the status of each device is displayed in sequence. For example, if there are four devices, Pass, Warn, Warn, and Warn are displayed.
- If the detection result is Warn, the detection fails. You can view the plog on the host, locate and rectify the fault based on the error code in the log file. Error codes starting with 1 indicate that the test case fails to be executed or the task fails to be delivered. Error codes starting with 4 indicate hardware problems.
- If the detection result is Pass, the detection is successful. For HBM detection, if the returned value is greater than 0, the value indicates the number of new ECC errors after the detection. This value is used to trigger the reporting and isolation of risk addresses in advance, ensuring normal running of subsequent services.
- cpu_detect: CPU detection
Display of the detection result:
- If no device is specified but there is only one device, only the status of this device is displayed.
- If the status of all devices is Pass, Warn, or Fail, Pass - All, Warn - All, or Fail - All is displayed.
- If the status of devices is inconsistent, the status of each device is displayed in sequence. For example, if there are four devices, Pass, Warn, Warn, and Fail are displayed.
- If the detection result is Fail, the hardware faults occur. In this case, contact technical support.
- If the detection result is Warn, task scheduling problems occur during the detection. You can view the detailed information in the plog on the host to locate the problems.
- If the detection result is Pass, the detection is successful.
- stress_detect: AI Core stress test
- d (optional): It specifies the ID of the device to be detected. If this parameter is not specified, the detection results of all devices are displayed by default. Pass indicates that the result is normal, and Warn indicates that the result is abnormal.
- timeout (optional): It specifies the hardware detection time, in seconds. If this parameter is not specified, the detection time is 600s by default. This parameter is valid only for HBM detection and CPU detection. For HBM detection, the value range is [0, 604800] (0 indicates that only one round of HBM detection is performed); for CPU detection, the value range is [1, 604800].
- output (optional): It specifies the directory for storing the detection result file diagnose_result_{time_stamp}.txt. If the command does not contain the output parameter, the command output is only printed on the screen but not flushed. If the value of output is empty or invalid, the specified directory does not have the write permission, or the directory fails to be created, the asys tool exits and reports an error.
- r (required): It indicates the detection mode. The values are as follows:
- Effect display:
- No device is specified, and all devices are normal. The following uses four devices as an example:
asys diagnose -r=stress_detect +------------------------+ -----------------------+ | Group of 4 Device | Diagnostic Result | +========================+ =======================+ +--- Performance --------+ -----------------------+ | Stress Detect | Pass - All | +------------------------+ -----------------------+ asys diagnose -r=hbm_detect --timeout=3000 +------------------------+------------------------+ | Group of 4 Device | Diagnostic Result | +========================+========================+ +--- Hardware -----------+------------------------+ | HBM | Pass - All | | | (0, 9, 0, 0) | +------------------------+------------------------+ asys diagnose -r=cpu_detect --timeout=3000 +------------------------+------------------------+ | Group of 4 Device | Diagnostic Result | +========================+========================+ +--- Hardware -----------+------------------------+ | CPU Detect | Pass - All | +------------------------+------------------------+
- No device is specified, and some devices are normal. The following uses four devices as an example:
asys diagnose -r=stress_detect +------------------------+ -----------------------+ | Group of 4 Device | Diagnostic Result | +========================+ =======================+ +--- Performance --------+ -----------------------+ | Stress Detect | Pass, Warn, Pass, Warn | +------------------------+ -----------------------+ asys diagnose -r=hbm_detect +------------------------+ -----------------------+ | Group of 4 Device | Diagnostic Result | +========================+ =======================+ +--- Hardware -----------+ -----------------------+ | HBM | Pass, Warn, Pass, Warn | | | (9, 0, 5, 0) | +------------------------+ -----------------------+ asys diagnose -r=cpu_detect +------------------------+------------------------+ | Group of 4 Device | Diagnostic Result | +========================+========================+ +--- Hardware -----------+------------------------+ | CPU Detect | Pass, Warn, Pass, Fail | +------------------------+------------------------+
- A device is specified. The following uses device 0 as an example:
asys diagnose -d=0 -r=stress_detect +--------------------+------------------------+ | Device ID: 0 | Diagnostic Result | +====================+========================+ +--- Performance ----+------------------------+ | Stress Detect | Pass | +--------------------+------------------------+ asys diagnose -d=0 -r=hbm_detect +------------------------+------------------------+ | Device ID: 0 | Diagnostic Result | +========================+========================+ +--- Hardware -----------+------------------------+ | HBM | Pass(9) | +------------------------+------------------------+ asys diagnose -d=0 -r=cpu_detect +------------------------+------------------------+ | Device ID: 0 | Diagnostic Result | +========================+========================+ +--- Hardware -----------+------------------------+ | CPU Detect | Pass | +------------------------+------------------------+
- No device is specified, and all devices are normal. The following uses four devices as an example:
Trace File Parsing
- Command:
# Parse the trace file. asys analyze -r=trace --file=filename --output=path
- Parameters:
- r: It specifies the parsing mode. The value trace indicates of parsing trace log files (*.bin files) into .txt files. The version of the environment where the asys tool is used must be the same as that of the environment where trace logs are generated.
The following is an example of the parsed .txt file:
2024-05-09 19:08:12.408.800 demo0: tid0[0], count0[0], tag0[struct0 tag], streamId0[0], deviceIdArray0[0, 1], hostIdArray0[1, 2, 3, 4] 2024-05-09 19:08:12.408.804 demo1: tag1[struct1 tag], streamId1[0], deviceIdArray1[0, 1], hostIdArray1[1, 2, 3, 4]
- file: It specifies a single file to be parsed. Set this parameter to the file name with the path. This parameter is required in trace mode.
- output (optional): It specifies the prefix of the result output directory of the asys tool. That is, the final output directory is {output}/asys_output_timestamp. If the command line does not contain the output parameter, the output is stored in the command execution directory. If the value of output is empty or an invalid character string, users do not have the write permission on the specified directory, or the directory fails to be created, the asys tool exits and reports an error.
- r: It specifies the parsing mode. The value trace indicates of parsing trace log files (*.bin files) into .txt files. The version of the environment where the asys tool is used must be the same as that of the environment where trace logs are generated.
Core Dump File Parsing
- Command:
# Parse the core dump file. asys analyze -r=coredump --exe_file=filename --core_file=filename --reg=reglevel --symbol=value --output=path
- Parameters:
- r: It specifies the parsing mode. Set it to coredump. The process is interrupted and exits during task execution. When the software exits, errors such as Segmentation fault are reported. You can use this function to parse the core file generated by the core dump and obtain the stack file (*.txt file) in stackcore format for subsequent fault locating.
- exe_file: It specifies the executable file when a core dump occurs. Set this parameter to a file name with the path. This parameter is required in coredump mode. Ensure that the value of this parameter matches that of core_file. Otherwise, the parsing result is incorrect.
- core_file: It specifies the core file when a core dump occurs. Set this parameter to a file name with the path. This parameter is required in coredump mode. Ensure that the value of this parameter matches that of exe_file. Otherwise, the parsing result is incorrect.
- reg: It specifies mode for adding register data for the core dump function. The value can only be 0, 1, or 2. The default value is 0. The parameter is optional in coredump mode.
- 0: Do not add register data.
- 1: Add one piece of register data for each thread.
- 2: Add register data to all thread stacks. (This operation will occupy a large number of host resources and is time-consuming.)
- symbol: It specifies the parsing mode of the core dump function. The value can only be 0 and 1. The default value is 0. The parameter is optional in coredump mode. If the address does not exist or stack overflow occurs, the asys tool may fail to parse the core dump function.
- 0: All lines with addresses are parsed into a stackcore file. Other lines are not parsed and Ignore is displayed.
The following is an example of the parsed stackcore file when symbol is set 0 (all address lines are parsed):
[process] crash reason: SIGABRT crash pid: 37246 crash tid: 37246 [stack] Thread 1 (37246) #00 0x00007fbad83792bf 0x00007fbad830b000 /usr/local/python3.7.5/lib/libpython3.7m.so.1.0 #01 Ignore #02 0x00007fbad83d8c22 0x00007fbad830b000 /usr/local/python3.7.5/lib/libpython3.7m.so.1.0 #03 0x00007fbad83e9648 0x00007fbad830b000 /usr/local/python3.7.5/lib/libpython3.7m.so.1.0 ...... [maps] Start Addr End Addr Size Offset objfile 0x562677ed1000 0x562677ed2000 0x1000 0x0 /usr/local/python3.7.5/bin/python3.7 0x562677ed2000 0x562677ed3000 0x1000 0x1000 /usr/local/python3.7.5/bin/python3.7 0x562677ed3000 0x562677ed4000 0x1000 0x2000 /usr/local/python3.7.5/bin/python3.7 0x562677ed4000 0x562677ed5000 0x1000 0x2000 /usr/local/python3.7.5/bin/python3.7 ...... - 1: Only in ?? () lines are parsed, and the raw GDB stack data in other lines is retained.
The following is an example of the parsed stackcore file when symbol is set 1 (only in ?? () lines are parsed):
[process] crash reason: SIGABRT crash pid: 37246 crash tid: 37246 [stack] Thread 1 (37246) #00 0x00007fbad83792bf in lookdict_unicode (value_addr=0x7ffea1e917e8, hash=<optimized out>, key=<optimized out>, mp=0x7fba98907fa0) at Objects/dictobject.c:811 #01 lookdict_unicode (mp=0x7fba98907fa0, key=<optimized out>, hash=<optimized out>, value_addr=0x7ffea1e917e8) at Objects/dictobject.c:783 #02 0x00007fbad83d8c22 in PyDict_GetItem (op=op@entry=0x7fba98907fa0, key=key@entry=0x7fba9a15b570) at Objects/dictobject.c:1327 #03 0x00007fbad83e9648 in _PyObject_GenericGetAttrWithDict (obj=obj@entry=0x7fba989083b0, name=name@entry=0x7fba9a15b570, dict=0x7fba98907fa0, dict@entry=0x0, suppress=suppress@entry=0) at Objects/object.c:1268 ...... [maps] Start Addr End Addr Size Offset objfile 0x562677ed1000 0x562677ed2000 0x1000 0x0 /usr/local/python3.7.5/bin/python3.7 0x562677ed2000 0x562677ed3000 0x1000 0x1000 /usr/local/python3.7.5/bin/python3.7 0x562677ed3000 0x562677ed4000 0x1000 0x2000 /usr/local/python3.7.5/bin/python3.7 0x562677ed4000 0x562677ed5000 0x1000 0x2000 /usr/local/python3.7.5/bin/python3.7 ......
- 0: All lines with addresses are parsed into a stackcore file. Other lines are not parsed and Ignore is displayed.
- output (optional): It specifies the prefix of the result output directory of the asys tool. That is, the final output directory is {output}/asys_output_timestamp. If the command line does not contain the output parameter, the output is stored in the command execution directory. If the value of output is empty or an invalid character string, users do not have the write permission on the specified directory, or the directory fails to be created, the asys tool exits and reports an error.
Stackcore File Parsing
The stackcore parsing function uses the readelf tool to obtain file information and the addr2line tool to parse stack function names and line numbers. Both of the tools are built-in tools of the Linux system. Ensure that the readelf and addr2line tools are installed, and that the user has the permission to execute scripts.
- Command:
# Parse the stackcore file. asys analyze -r=stackcore --file=filename --symbol_path=path1,path2 --output=path3
- Parameters:
- r: It specifies the parsing mode. Setting this parameter to stackcore indicates of parsing stackcore files (*.txt files) for subsequent fault locating.
The following is an example of the parsed .txt file. In the file, the thread information starts with Thread Number (Thread ID, Thread name). If the thread name fails to be obtained, unknown is displayed.
[process] crash reason:6 crash pid:37246 crash tid:37246 crash stack base:0x00007ffea1e96000 crash stack top:0x00007ffea1e91770 [stack] Thread 1 (37246, python3.7) #00 0x00007fbad83792bf lookdict_unicode in dictobject.c:811 from libpython3.7m.so.1.0 #01 lookdict_unicode in dictobject.c:783 from libpython3.7m.so.1.0 #02 0x00007fbad83d8c22 PyDict_GetItem in dictobject.c:1328 from libpython3.7m.so.1.0 #03 0x00007fbad83e9648 _PyObject_GenericGetAttrWithDict in object.c:1269 from libpython3.7m.so.1.0 #04 0x00007fbad83e6729 module_getattro in moduleobject.c:704 from libpython3.7m.so.1.0 #05 0x00007fbad83e937b _PyObject_GetMethod in object.c:1137 from libpython3.7m.so.1.0 ...... [maps] e0000380000-e0000381000 rw-p 00000000 00:00 0 e00003c0000-e00003c1000 rw-p 00000000 00:00 0 562677ed1000-562677ed2000 r--p 00000000 fd:00 13113992 /usr/local/python3.7.5/bin/python3.7 562677ed2000-562677ed3000 r-xp 00001000 fd:00 13113992 /usr/local/python3.7.5/bin/python3.7 562677ed3000-562677ed4000 r--p 00002000 fd:00 13113992 /usr/local/python3.7.5/bin/python3.7 ......
- file: It specifies a single file to be parsed. Set this parameter to the file name with the path. This parameter is required in stackcore mode.
You can also use the path parameter to specify a directory to parse multiple files in the directory and its subdirectories. This parameter is mutually exclusive with the file parameter. This parameter is required in stackcore mode.
- symbol_path: It specifies the dynamic library directory required for parsing in stackcore mode. Multiple directories can be transferred and separated by commas (,). Only the dynamic libraries in the current directory are scanned. Path 1 is scanned followed by path 2. Sub-directories are not scanned. To prevent incorrect parsing, you are advised to place related dynamic libraries in the same path. The symbol_path parameter is optional in stackcore mode. If the parameter is not specified, the required dynamic library paths are obtained from the stackcore file. To ensure that the dynamic library files can be found, you are advised to use this parameter only in the environment where the core dump error occurs.
- output (optional): It specifies the prefix of the result output directory of the asys tool. That is, the final output directory is {output}/asys_output_timestamp. If the command line does not contain the output parameter, the output is stored in the command execution directory. If the value of output is empty or an invalid character string, users do not have the write permission on the specified directory, or the directory fails to be created, the asys tool exits and reports an error.
- If the parsed .txt file contains ?, the possible causes are as follows:
- Compile option: The -g option is not used during compilation of the dynamic library file to retain debugging information in the file.
- Link parameter not added: -rdynamic is not used to instruct the linker to add all symbols to the dynamic symbol table.
- Dynamic library not found: No matching dynamic library is found.
- When the stackcore parsing function parses function names and line numbers, the line numbers parsed from some dynamic libraries are slightly different from the actual situation. The reasons are as follows:
- Compile option: Different compile options, especially those related to debugging information, may have impacts.
- Optimization level: A higher optimization level may cause code reorganization and optimization, resulting in deviation between the line numbers and the raw source codes.
- r: It specifies the parsing mode. Setting this parameter to stackcore indicates of parsing stackcore files (*.txt files) for subsequent fault locating.
Real-time Stack Export
This function is used to export stack information to locate faults when service processes are suspended. When the service is not suspended, real-time stack export may fail due to signal sending failure, Bin file generation timeout, or Bin file parsing failure. In addition, the stack information of the same suspended process cannot be exported concurrently. Otherwise, the command may fail to be executed.
- Command:
# Export the stack. asys collect -r=stacktrace --remote=pid --all --quiet
- Parameters:
- r: When this parameter is set to stacktrace, stack information is exported in real time for subsequent fault locating. If this parameter is not set, fault information is collected. In this case, the --remote, --all and --quiet parameters cannot be used.
After the command is executed successfully, obtain the exported Bin file as prompted.
- remote: It specifies the ID of the process that is suspended. This parameter is required when -r is set to stacktrace. The ID must be greater than or equal to 2. If the input process ID does not exist, the asys command reports an error and exits.
- all: If this parameter is set, the stack information of all threads in the suspended process is exported. This parameter is required when -r is set to stacktrace.
- quiet (optional): If this parameter is set, user interaction is disabled during stack information export. If this parameter is not set, user interaction is enabled by default, and you need to confirm whether the signal set for trace processing is enabled on the current server (whether ASCEND_COREDUMP_SIGNAL is set to a value other than none or is not set). This parameter can be used when -r is set to stacktrace.
When real-time stack information is exported, signal 35 needs to be sent to the specified process. If the signal set for trace processing is disabled, the suspended process is stopped and stack information cannot be exported.
For details about the ASCEND_COREDUMP_SIGNAL environment variable and the signal set for trace processing, see Environment Variables.
- r: When this parameter is set to stacktrace, stack information is exported in real time for subsequent fault locating. If this parameter is not set, fault information is collected. In this case, the --remote, --all and --quiet parameters cannot be used.
Environment Configuration
You must run the commands related to environment configuration as the root user on a physical machine.
- Command:
# Query the stress test configuration. asys config -d=deviceId --get --stress_detect # Restore the stress test configuration. asys config -d=deviceId --restore --stress_detect
- Parameters:
- d (optional): It specifies the ID of the device to be operated. If this parameter is not specified, the configuration of device 0 is obtained or restored by default.
- get: It is used to obtain the specified configuration.
- restore: It is used to restore the specified configuration.
- stress_detect: it indicates the stress test configuration. get and stress_detect are used together to obtain stress test configuration. restore and stress_detect are used together to restore the stress test configuration.
- Effect display:
# Obtain the stress test configuration. asys config -d=0 --get --stress_detect +--------------------------------+---------------------------------+ | Device ID: 0 | CURRENT CONFIGURATION | +================================+=================================+ | AI Core Voltage (MV) | 850 | | Bus Voltage (MV) | 850 | +--------------------------------+---------------------------------+ # Restore the stress test configuration. asys config -d=0 --restore --stress_detect [ASYS] [INFO]: Configuration successfully restore, on device 0.
FAQs About Service Re-run Errors
- Symptom
Press Ctrl+Z to stop a service re-run task, and then launch it again. The log shows that an error occurs in the service re-run task. Figure 1 shows an example.
- Possible Cause
After the task is stopped abnormally after operations such as pressing Ctrl+Z, there are residual task processes (and operations such as redirection for file writing are being performed), which conflict with the newly launched asys re-run task. As a result, the re-run task is abnormal.
- Solution
Before the asys re-run, check whether IDs of running inference or training processes exist. If yes, manually kill the processes and launch the asys re-run.
Timeout Occurs When the asys Tool Is Used to Export Real-Time Stack Information
- Symptom
When the asys tool is used to export real-time stack information, an error is reported indicating that the export times out in some scenarios. The error information is as follows:
[ASYS] [ERROR]: Generating the stackcore bin file timeout. For details, see the related description in the document.
- Possible causes and solutions
- The real-time stack export function has not been initialized.
In this case, wait until the initialization is complete. You can determine that the initialization is complete based on the attr init success keyword in the plog ($HOME/ascend/log/run|debug/plog/plog-pid_*.log by default). Then, try to export the real-time stack information.
- The ASCEND_COREDUMP_SIGNAL environment variable is set to none, and some signal sets are disabled. As a result, the real-time stack export function is unavailable.
You can determine that the signal sets are disabled based on the close the signal capture function keyword in the plog ($HOME/ascend/log/run|debug/plog/plog-pid_*.log by default). In this case, you need to set the ASCEND_COREDUMP_SIGNAL environment variable to enable the signal sets. For details about the ASCEND_COREDUMP_SIGNAL environment variable and the signal sets for trace processing, see Environment Variables.
- The service execution is complete, and the resources related to the real-time stack export function have been freed.
You can determine that the resources related to the real-time stack export function have been freed based on the keyword unregister all signo sigaction, can not capture signal in the plog ($HOME/ascend/log/run|debug/plog/plog-pid_*.log by default). In this case, you need to execute the user service again to export the real-time stack information.
- The real-time stack export function is abnormal.
In this case, you can search for the keyword ERROR in the plog ($HOME/ascend/log/run|debug/plog/plog-pid_*.log by default) to view the error information and contact technical support. Click here to contact technical support.
- The real-time stack export function has not been initialized.
