Tuning Workflow

Typical memory issues are categorized as shown in Table 1.

Table 1 Memory issue category

Category

Symptom

Scenario

Memory corruption

Accuracy anomalies or NaN values occur, typically on the device side.

Training, inference, and operator development

Excessive memory usage

This is usually related to the following two cases:

  • Memory leak or out of memory (OOM)
    • Continuous growth of memory usage on the host side, potentially leading to OOM
    • Continuous growth of memory usage on the device side, potentially leading to OOM
  • Significant deviation from the expected or baseline values

    Actual collected memory usage data significantly exceeds expected or baseline values, often by GB level. This typically occurs on the device side.

Training and inference

Troubleshooting Process

The following workflow shows the process of analyzing excessive memory usage or OOM on the device side.

  1. Use the performance tuning tool to collect profile data and import the data to MindStudio Insight.
  2. Examine memory curves and details of operator or component memory allocation and release in the Memory Analysis area on the Memory page to preliminarily locate the fault and identify the exception scope, step, or operator.
  3. Use the memory leak detection tool (msLeaks) to collect memory details and memory disassembly data for the identified exception scope, and import the data to MindStudio Insight.
  4. On the Leaks page, analyze memory usage based on the Function Stack Flame Graph, Memory Request/Release Line Graph & Memory Block Graph, and Memory Details Table.