Memory Comparison
If the training and inference parameters are the same but the CANN version does not match the version of Ascend Extension for PyTorch or MindSpore frameworks, memory usage of two different steps of the training and inference job may be different, causing excessive memory usage or even OOM. The msLeaks tool provides the capability of comparison, analysis, and problem location.
Tool Usage
Before using this function, collect data of two different steps.
- Use an environment variable to disable the optimization of the task_queue operator dispatch queue.
export TASK_QUEUE_ENABLE=0
- Add the mstx instrumentation code to the training and inference code. For details, see Memory Leak Analysis.
- Run the following command to use the msLeaks tool to collect memory data of a specified step. Data of two different steps needs to be collected. You are advised to collect data of only one step at a time. After the data of two different steps is collected, the data can be used for memory comparison and analysis between steps.
msleaks [options] ${Application} --steps=<Required Step> --level=kernel- options: CLI parameters. For details, see Table 1.
- Application: the user program.
- --steps: step IDs whose memory needs to be collected.
- Run the following command to compare the memory usage between the two steps:
msleaks --compare --input=path1,path2 --level=kernel
The --compare and --input parameters must be used together. If they are used separately, the command is invalid. In addition, the two file paths entered by --input must be separated by a full-width or half-width comma (,). The --level parameter can be set to op.
- The result directory generated after the comparison between steps is as follows:
|- leaksDumpResults |- compare |- memory_compare_{timestamp}.csv
Result Description
You can query and locate the memory problems between steps based on the output file. For details about the output file, see Output Description.