NPU_COLLECT_PATH

Description

Sets a path for storing fault information, including dump graphs, abnormal data of AI Core operators, and operator compilation information. The path can be an absolute path or a relative path (the path relative to the location of the executable program or command), on which users must have the read, write, and execute permissions. If the path does not exist, the system automatically creates the directory in the path.

Pay attention to the following points when using this environment variable:

If this environment variable is set, the dump data collection of abnormal operators is automatically enabled by default.
If the environment variable is set and the model is converted, debugging information is added during OM model compilation. As a result, the size of the OM model file increases. If buffer planning is considered or resources are limited, delete this environment variable after the debugging is complete.
If this environment variable is set, only L1 exception dump information is collected. Model dump information, single-operator dump information, overflow operator dump information, and L0 exception dump information are not collected.
The priority of the directory for storing L1 exception dump information is as follows: NPU_COLLECT_PATH -> ASCEND_WORK_PATH -> default path (extra-info directory of the current path of the executed program).

L1 exception dump is common exception dump information, while L0 exception dump is lite exception dump information. Both of them export information such as the operator input, operator data, and workspace data. Compared with L0 exception dump, L1 exception dump provides more information. When L1 exception dump is enabled, the dtype information of each tensor is printed in the host application log file (plog), and the operator name and kernel related to the operator are also printed.
If this environment variable is set, the operator is compiled online during model compilation, and the compiled operator binary file is no longer used.
The directory for storing the operator information (.o and .json files) compiled online has the following priority: NPU_COLLECT_PATH -> ASCEND_CACHE_PATH -> default path (${HOME}/atc_data).

Example

export NPU_COLLECT_PATH=$HOME/demo/

Restrictions

When a single-operator API (for example, aclnn API) is called, the compiled operator binary file is used. Online operator compilation is not involved.
If the environment variable NPU_COLLECT_PATH is set, the operator compilation files (including .o and .json files) for the following operators in graph mode cannot be generated in the path specified by this environment variable:
MatMulAllReduce

MatMulAllReduceAddRmsNorm

AllGatherMatMul

MatMulReduceScatter

AlltoAllAllGatherBatchMatMul

BatchMatMulReduceScatterAlltoAll
If the environment variable NPU_COLLECT_PATH is configured, the function of checking whether Global Memory is invalidly accessed cannot be enabled. Otherwise, an error is reported when the compiled model file or operator kernel package is used. To enable the function of checking whether Global Memory is invalidly accessed, there are the following ways:
- When using the ATC for model conversion, configure oom in the configuration file specified by the --op_debug_config option. For details, see ATC Instructions.
- When using op_compiler, configure oom in the configuration file specified by the --op_debug_config option. For details, see Operator Compilation Tool User Guide.
- When using Ascend Graph for graph construction, configure op_debug_config or OP_DEBUG_CONFIG to oom. For details, see Ascend Graph Developer Guide.
- When migrating the training script developed using Python APIs based on TensorFlow to the Ascend AI Processor for training, configure op_debug_config to oom. For details, see TensorFlow 1.15 Model Porting Guide and TensorFlow 2.6.5 Model Porting Guide.

Applicability

Atlas 200/300/500 Inference Product

Atlas Training Series Product

Parent topic: Fault Information Collection