Collection with Command Line Interfaces

Run the following command to start msLeaks and collect memory data.
  • Method 1 (recommended): user.sh is the user script.
    msleaks [options] bash user.sh
  • Method 2
    msleaks [options] -- <prog_name> [prog_options]
  • You can configure the environment variable TASK_QUEUE_ENABLE as required. For details, see "TASK_QUEUE_ENABLE" in Ascend Extension for PyTorch Environment Variable Reference.

    When TASK_QUEUE_ENABLE is set to 2, the level 2 optimization of the task_queue operator dispatch queue is enabled. At this time, workspace memory will be collected.

  • msLeaks can collect memory data in open-form environments. It collects memory allocation and deallocation events of the HAL type. For details about how to install msLeaks in the open-form environment, see CANN Software Installation Guide (Open-Form).
  • When you run the msLeaks tool as the user root, the system skips file permission verification by printing a message, which poses security risks. You are advised to run the msLeaks tool as a common user.
Table 1 CLI parameter description

Parameter

Description

options

CLI parameters. For details, see Table 2.

prog_name

User script name. Ensure the security of the custom script.

This parameter is not required when the memory comparison function is enabled.

prog_options

User script parameter. Ensure the security of the custom script parameter.

This parameter is not required when the memory comparison function is enabled.

Table 2 Parameter description

Category

Parameter

Description

Required (Yes/No)

Common options

--help, -h

Output the msLeaks help information.

No

--version, -v

Output the msLeaks version information.

No

--steps

Select the step ID of memory information to be collected. The values must be integers within the actual step range. You can configure one or more step IDs, with a maximum of 5 currently supported. The input step IDs are separated by a full-width or half-width comma (,). If this parameter is not set, the memory information of all steps is collected by default.

Example: --steps=1,2,3.

No

--device

Collect device information. The options are npu and npu:{id}. The default value is npu. The value cannot be empty. You can select multiple values at the same time. Use a full-width or half-width comma (,) to separate the values. Example: --device=npu.

If the value contains both npu and npu:{id}, the memory information of all NPUs is collected by default, and npu:{id} does not take effect.

  • npu: collects the memory information of all NPUs.
  • npu:{id}: collects the NPU memory information of a specified ID. The value of id is the specified ID number. The value range is [0, 31]. The memory information of multiple IDs can be collected. Use a full-width or half-width comma (,) to separate the values. Example: --device=npu:2,npu:7.

No

--level

Collect operator information. The options are 0 and 1, with 0 by default. Example: --level=0.

  • 0: The value can also be op, which collects information about op operators.
  • 1: The value can also be kernel, which collects information about kernel operators.

No

--events

Collect events. The options are alloc, free, launch, and access, with alloc, free, or launch by default. The values are separated by a full-width or half-width comma (,). Example: --events=alloc,free,launch.

  • alloc: collects memory allocation events.
  • free: collects memory deallocation events.
  • launch: collects operator/kernel dispatch events.
  • access: collects memory access events. Currently, only memory access events in the ATB and Ascend Extension for PyTorch operator scenarios can be collected.

Note that when --events=alloc is set, free is added by default. The actual collection items are alloc and free. When --events=free is set, alloc is added by default. The actual collection items are alloc and free. When --events=access is set, alloc and free are added by default. The actual collection items are access, alloc, and free.

No

--call-stack

Collect call stacks. The options are python and c. You can select both of them and separate them with a full-width or half-width comma (,). You can set the call stack collection depth. Enter a number after the option. The option and the number are separated by a colon (:), indicating the collection depth. The value range is [0, 1000]. The default value is 50. Example: --call-stack=python, --call-stack=c:20,python:10.

  • python: collects the python call stack.
  • c: collects the c call stack.

No

--collect-mode

Memory collection mode. The options are immediate and deferred, with immediate by default. Only one value can be selected. Example: --collect-mode=immediate.

  • immediate: collects memory information immediately when the user script starts to run, and stops collecting when the user script stops running. You can also use the Python custom collection interface to control the collection scope.
  • deferred: collects data after the msleaks.start() script is executed. You need to use the Python custom collection interface.

    If only --collect-mode is set to deferred is set and the custom Python interface for collection is not used, no data (except for a small amount of system data) is collected by default.

No

--analysis

Enable the related memory analysis function. The default value is leaks. If the value of --analysis is empty, no analysis function is enabled. You can select multiple values and separate them with a full-width or half-width comma (,). Example: --analysis=leaks,decompose.

  • leaks: identifies memory leak events.
  • inefficient: identifies inefficient memory. Only when --events=access,alloc,free,launch is set, inefficient memory in ATB LLM and Ascend Extension for PyTorch single-operator scenarios can be identified.

    Inefficient memory can also be identified offline. You can customize the identification using application programming interfaces (APIs). For details, see API Reference.

  • decompose: enables the memory decomposition function.

Note that when --analysis=leaks or --analysis=decompose, alloc and free of --events are enabled by default, that is, --events=alloc,free.

No

--data-format

Output file formats. The options are db and csv. Select a format as required. The value cannot be empty, with csv by default. Example: --data-format=db.

If the output file is in db format, you can use the MindStudio Insight tool to display the file. For details, see "Memory Tuning" in MindStudio Insight User Guide.

  • db: .db files. Ensure that the SQLite dependency has been installed.

    Run sqlite3 --version to check whether the SQLite dependency is installed. If not, run sudo apt-get install sqlite3 libsqlite3-dev to install it.

  • csv: .csv files.

No

--watch

Monitor memory blocks. The options are start, out{id}, end, and full-content. end is mandatory. You can select multiple options, which are separated by a full-width or half-width comma (,). The parameter setting format is --watch=start:out{id},end,full-content. Example: --watch=op0,op1,full-content.

  • start: optional. The value is a string, indicating an operator. The format varies depending on the framework. start is mandatory when out{id} needs to be set.
  • out{id}: optional. It indicates the output ID of the operator. When the tensor is a list, you can specify the tensor that needs to be dumped to a given path. The value is the subscript number of the tensor in the list.
  • end: mandatory. The value is a string, indicating an operator. The format varies depending on the framework.
  • full-content: optional. If this value is selected, the complete tensor data is dumped to the specified path. If this value is not selected, the hash value of the tensor is dumped to the specified path.

No

--output

Specify the dump path of the output. The maximum length of the path is 4,096 characters. The default dump directory is leaksDumpResults.

Example: --output=/home/projects/output.

No

--log-level

Specify the level of the output log. The value can be info, warn, or error, with warn by default.

No

Memory comparison parameters

--compare

Enable the memory data comparison function between steps.

No

--input

Absolute directory of the comparison files. You need to enter the directories of the baseline file and comparison file and separate them with a full-width or half-width comma (,). This parameter is valid only when the compare function is enabled. The maximum length of the path is 4,096 characters.

Example: --input=/home/projects/input1,/home/projects/input2.

No

  • When --events=launch is specified to collect Aten operator dispatch and access events, the PyTorch 2.3.1 and later versions in the framework Ascend Extension for PyTorch are required.
  • When the value of --analysis contains decompose, the parameter Attr in the file leaks_dump_{timestamp}.csv contains the GPU memory type and component name.
  • When the value of --analysis contains decompose, the memory decomposition function is enabled. Currently, the memory pools of the Ascend Extension for PyTorch, MindSpore, and ATB operator frameworks can be classified, but fine-grained classification of the memory pools of the MindSpore framework and ATB operator framework is not supported yet. In the Ascend Extension for PyTorch framework, the fine-grained classification of aten, weight, gradient, and optimizer_state is supported. The weight, gradient, and optimizer_state are applicable only to the PyTorch training scenario (that is, the scenario where the optimizer.step() interface is called). aten is the memory allocated by the aten operator, and the PyTorch 2.3.1 and later versions are required. --level must contain 0, and --events must contain alloc, free, and access.
  • When --level=1 is specified and the tokenizers library of Hugging Face is used, the alarm "The current process just got forked. Parallelism is disabled." may be reported. This alarm does not affect functions and can be ignored. To avoid this alarm, run export TOKENIZERS_PARALLELISM=false to disable the parallelism behavior.
  • When --collect-mode is set to deferred and the Python custom collection interface is used to collect data, the memory analysis function in a step is unavailable. The functions of memory block monitoring, decomposition, and inefficient memory identification are available only for the data within the collection scope.