Memory Check
Memory check is used to detect exceptions during program running. It can detect and report memory access exceptions such as out-of-bounds issues related to the external memory (global memory) and internal memory (local memory) and unaligned memory during operator running.
msSanitizer does not support memory check for the operator repository of the Ascend Transformer Boost (ATB).
Supported Memory Exception Types
Exception |
Description |
Location |
Address Space |
|---|---|---|---|
This exception occurs when unallocated memory is accessed. |
kernel, host |
GM, UB, L0{A,B,C}, L1 |
|
This exception occurs when the AI Core accesses overlapped memory. |
kernel |
GM |
|
This exception occurs when the address transferred by DMA (responsible for transferring data between the global memory and local memory) is not aligned with the minimum memory access granularity. |
kernel |
GM, UB, L0{A,B,C}, L1 |
|
This exception occurs when the IP address that is not allocated or has been released is released. |
host |
GM |
|
This exception occurs when the memory usage keeps increasing during program running after the allocated memory is not released after being used. |
host |
GM |
|
This exception occurs when memory is not used after being allocated. |
kernel, host |
GM |
Enabling Memory Check
When msSanitizer is running, memory check memcheck is enabled by default. application indicates the user program.
- You can run the following command to explicitly specify memory check types. By default, detection of illegal read/write, multi-core corruption, unaligned access, and illegal release is enabled.
mssanitizer --tool=memcheck application
- Run the following command to manually enable memory leak detection.
mssanitizer --tool=memcheck --leak-check=yes application
- Run the following command to manually enable detection for unused memory that has been allocated.
mssanitizer --tool=memcheck --check-unused-memory=yes application
- After the user program is complete, an exception report is displayed on the GUI. For details about the exceptions, see Memory Exception Report Parsing.
- When you use a framework such as PyTorch to access an operator, the framework may use a memory pool to manage the GM. However, the memory pool usually allocates a large amount of the GM at a time and reuses the memory during running. In this case, if you check the operator and record all memory allocation and release information about the GM, the detection information may be inaccurate due to the memory management mode of the memory pool. Therefore, the detection tool provides APIs for manually reporting the GM allocation information so that you can manually report the GM range used by an operator when the operator is called. For details about APIs, see SanitizerReportMalloc and SanitizerReportFree.
Memory Exception Report Parsing
The memory check exception report contains multiple types of exception information. The following provides some examples about simple exception information to help you understand the information in the exception report.
- Illegal read/write
This exception is generated when an unallocated memory is accessed in read or write mode in the operator program. It usually occurs in the GM or on-chip memory. The GM exception occurs because the size allocated by the GM is inconsistent with the access range in the actual operator program. The on-chip memory exception occurs because the access range of the operator program exceeds the upper limit of the hardware capacity.
1 2 3 4 5 6 7 8
====== ERROR: illegal read of size 224 // Basic error information, including invalid read/write type (invalid read or invalid write) and invalid access bytes. ====== at 0x12c0c0015000 on GM // Memory location where the exception occurs, including the address space and memory address. The memory address refers to the start address in a memory access. ====== in block aiv(0) // Block index of the vector core corresponding to the exception code. ====== code in pc current 0x77c (serialNo:10) // PC pointer where the exception occurs and sequence number of the API call behavior. ====== #0 ${ASCEND_HOME_PATH}/compiler/tikcpp/tikcfw/impl/dav_c220/kernel_operator_data_copy_impl.h:58:9 // The following is the code call stack where the exception occurs, including the file name, line number, and column number. ====== #1 ${ASCEND_HOME_PATH}/compiler/tikcpp/tikcfw/inner_interface/inner_kernel_operator_data_copy_intf.cppm:58:9 ====== #2 ${ASCEND_HOME_PATH}/compiler/tikcpp/tikcfw/inner_interface/inner_kernel_operator_data_copy_intf.cppm:443:5 ====== #3 illegal_read_and_write/add_custom.cpp:18:5
In the preceding example, the 0x12c0c0015000 address on GM is illegally read, and the instruction that causes the exception corresponds to line 18 in the operator implementation file add_custom.cpp.
- Multi-core corruption
The AI Core is the computing core of the Ascend AI Processor. The AI processor has multiple AI Cores, and operators run on these AI Cores. AI Cores move data in or out of GM during computation. When inter-core synchronization is not explicitly performed, if the GM accessed by the cores overlaps and at least one of the cores writes the overlapped address, a multi-core corruption exception occurs. This exception can be avoided by ensuring that a core completely owns a block of memory when the memory is written to it. When other cores access the memory, the out of bounds exception occurs.
1 2 3 4 5 6 7 8
====== WARNING: out of bounds of size 256 // Basic error information, including the number of corrupted bytes. ====== at 0x12c0c00150fc on GM when writing data // Memory location where the exception occurs, including the address space and memory address. The memory address refers to the start address in a memory access. ====== in block aiv(9) // Block index of the vector core corresponding to the exception code. ====== code in pc current 0x7b8 (serialNo:22) // PC pointer where the exception occurs and sequence number of the API call behavior. ====== #0 ${ASCEND_HOME_PATH}/compiler/tikcpp/tikcfw/impl/dav_c220/kernel_operator_data_copy_impl.h:103:9 // The following is the code call stack where the exception occurs, including the file name, line number, and column number. ====== #1 ${ASCEND_HOME_PATH}/compiler/tikcpp/tikcfw/inner_interface/inner_kernel_operator_data_copy_intf.cppm:155:9 ====== #2 ${ASCEND_HOME_PATH}/compiler/tikcpp/tikcfw/inner_interface/inner_kernel_operator_data_copy_intf.cppm:461:5 ====== #3 out_of_bound/add_custom.cpp:21:5
In the preceding example, a total of 256 bytes are corrupted, and multi-core corruption occurs when the 0x12c0c00150fc address in the GM is accessed. In addition, the instruction that causes the exception corresponds to line 21 in the operator implementation file add_custom.cpp.
- Unaligned access
The Ascend processor supports multiple types of memory. When the memory is accessed by the DMA, different types of memory have different minimum access granularities on different processors. When the accessed memory address is not aligned with the minimum access granularity, problems such as data exceptions or AI Core exceptions occur. Access alignment detection can output alignment exception information when such a problem occurs.
1 2 3 4 5 6 7 8
====== ERROR: misaligned access of size 13 // Basic error information, including the number of bytes pertaining to the alignment exception. ====== at 0x6 on UB // Memory location where the exception occurs, including the address space and memory address. ====== in block aiv(0) // Block index of the vector core corresponding to the exception code. ====== code in pc current 0x780 (serialNo:33) // PC pointer where the exception occurs and sequence number of the API call behavior. ====== #0 ${ASCEND_HOME_PATH}/compiler/tikcpp/tikcfw/impl/dav_c220/kernel_operator_data_copy_impl.h:103:9 // The following is the code call stack where the exception occurs, including the file name, line number, and column number. ====== #1 ${ASCEND_HOME_PATH}/compiler/tikcpp/tikcfw/inner_interface/inner_kernel_operator_data_copy_intf.cppm:155:9 ====== #2 ${ASCEND_HOME_PATH}/compiler/tikcpp/tikcfw/inner_interface/inner_kernel_operator_data_copy_intf.cppm:461:5 ====== #3 illegal_align/add_custom.cpp:18:5
In the preceding example, the alignment exception occurs on 13 bytes when the 0x6 address on the UB is accessed. The instruction that causes this exception corresponds to line 18 in the operator implementation file add_custom.cpp.
- Memory leak
Memory check can detect memory leak on the device. This problem is usually caused by developers' failure to correctly free the memory allocated by the AscendCL interface. Memory allocation is not performed in internal storage (local memory). Therefore, memory leak occurs only in the GM. You can specify the --leak-check=yes parameter to enable memory leak detection.
1 2 3 4 5 6
====== ERROR: LeakCheck: detected memory leaks // Memory leak occurs. ====== Direct leak of 100 byte(s) // Information about the leaked memory ====== at 0x124080013000 on GM allocated in add_custom.cpp:14 (serialNo:37) ====== Direct leak of 1000 byte(s) ====== at 0x124080014000 on GM allocated in add_custom.cpp:15 (serialNo:55) ====== SUMMARY: 1100 byte(s) leaked in 2 allocation(s) // Summary of memory leaks, including the number of memory leaks and the number of leaked bytes.
In the preceding example, the first piece of memory leak information includes the address space, memory address, memory size, and code location. The code location points to the file name and line number of the call that allocates the memory.
- Illegal release
Invalid address release refers to the release of an unassigned or released address, which usually occurs on the GM.
1 2 3
====== ERROR: illegal free() // Basic error information, indicating that an illegal release exception occurs. ====== at 0x124080013000 on GM // Memory location where the exception occurs, including the address space and memory address. ====== code in add_custom.cpp:84 (serialNo:63) // Code location when the exception occurs, including the file name, line number, and sequence number of the API call behavior.
In the preceding example, the 0x124080013000 address in the GM is illegally released, and the instruction that causes the exception corresponds to line 84 in the operator implementation file add_custom.cpp.
- Allocated memory unused
The memory is not used when the operator is running but is not used until the operator is executed. The possible cause is that the operator uses incorrect memory or the logic of the operator is incorrect. This problem usually occurs on the GM.
1 2 3 4
====== WARNING: Unused memory of 1000 byte(s) // Basic error information, indicating that the allocated memory is not used. ====== at 1240c0016000 on GM // Memory location where the exception occurs, including the address space and memory address. ====== code in add_custom.cpp:2 (serialNo:69) // Code location when the exception occurs, including the file name, line number, and sequence number of the API call behavior. ====== SUMMARY: 1100byte(s) unused memory in 2 allocation(s) // Exception summary, including the number of used memory blocks and bytes.
- WARNING: Exceptions at this level are uncertain risks, which are determined by the actual situation. For example, the risk of multi-core corruption involves operations on the same memory block by multiple cores. Experienced users can avoid this risk by means of inter-core synchronization. But this exception for beginners is a high risk.
- ERROR: Exceptions at this level have the highest severity, involving deterministic errors for memory operations, such as illegal read/write, memory leak, and unaligned access. It is strongly recommended that you check for exceptions at this level.