How Do I Collect Minimum Bloat Data in the PyTorch Scenario?

Symptom

If Profiling is enabled to collect profile data during model running, performance bloat occurs. Specifically, the time required for displaying the profile data collection step after Profiling is enabled is longer than that when Profiling is disabled. The time difference is called the bloat time.

Possible Cause

The bloat degree is related to the Profiling settings. The following lists the major factors that affect the model performance:

with_stack: If with_stack is enabled, the call stack information of the model is obtained. (The performance is greatly affected. You can use with_modules that has little impact.)
profiler_level: A higher profiler level indicates a larger amount of collected data and greater performance bloat.
Set activities to the type of the collection.
Other collection switches.

Troubleshooting

Recommended configuration for collecting performance data:

You can use the recommended configuration to obtain the host and device data required for general performance analysis.

experimental_config = torch_npu.profiler._ExperimentalConfig(
    export_type=torch_npu.profiler.ExportType.Text,
    aic_metrics=torch_npu.profiler.AiCMetrics.PipeUtilization,
    profiler_level=torch_npu.profiler.ProfilerLevel.Level1,
    l2_cache=False,
    data_simplification=False
)
    with torch_npu.profiler.profile(
    activities=[
    torch_npu.profiler.ProfilerActivity.CPU,
    torch_npu.profiler.ProfilerActivity.NPU
    ],
    schedule=torch_npu.profiler.schedule(wait=1, warmup=1, active=2, repeat=2, skip_first=10),
    on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./result"),
    experimental_config=experimental_config) as prof:
    for step in range(steps):
        train_one_step(step, steps, train_loader, model, optimizer, criterion)
        prof.step()

Collect the minimum bloat data.

The minimum bloat collection setting is used to compare with the GPU performance data.

with torch_npu.profiler.profile(
    activities=[torch_npu.profiler.ProfilerActivity.NPU],
    schedule=torch_npu.profiler.schedule(wait=1, warmup=1, active=1, repeat=1, skip_first=20),
    on_trace_ready=torch_npu.profiler.tensorboard_trace_handler("./npu-profling-least-inflation")
) as prof:
    for step, x in enumerate(train_dataloader):    # train_one_step
        ...
        prof.step()

Parent topic: FAQs