Preparation

Preparing Software

Preparing Data

The following data collected is memory leak data.

  1. Use the msLeaks tool to run the following command. In each step, a 4 x 10 MB tensor is allocated and added to the global variable list leak_mem_list (not released with train_one_step). Python trace data is collected for three steps.
    msleaks --level=0,1 --events=alloc,free,access,launch --analysis=decompose --data-format=db python test.py

    The sample code of test.py is as follows:

     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    import torch
    import torch_npu
    from torchvision.models import resnet50
    import msleaks
    import msleaks.describe as describe
    leak_mem_list = []
    def train_one_step(model, optimizer, loss_fn, device):
        # Mark the code block. The owner attribute of all memory allocation events in the code block will be labeled as leaks_mem.
        describe.describer(owner="leaks_mem").__enter__()
        # Memory leak code segment
        leak_mem_list.append(torch.randn(1024 * 1024 * 10, dtype=torch.float32).to(device))
        # End marker
        describe.describer(owner="leaks_mem").__exit__(None, None, None)
        # Single training step code block
        inputs = torch.randn(1, 3, 224, 224).to(device)
        labels = torch.rand(1, 10).to(device)
        pred = model(inputs)
        loss_fn(pred, labels).backward()
        optimizer.step()
        optimizer.zero_grad()
    def train(model, optimizer, loss_fn, device, steps=1):
        for i in range(steps):
            train_one_step(model, optimizer, loss_fn, device)
    device = torch.device("npu:0")
    torch.npu.set_device (device) # Set the device
    model = resnet50(pretrained=False, num_classes=10).to(device) # Load the model
    optimizer = torch.optim.Adam(model.parameters(), lr=1e-2) # Define the optimizer
    loss_fn = torch.nn.CrossEntropyLoss() # Define the loss function
    
    # Enable Python function call data collection
    msleaks.tracer.start()
    train(model, optimizer, loss_fn, device, steps=3) # Start training
    
    # Disable Python function call data collection
    msleaks.tracer.stop()
    
  2. After the collection is complete, a file in .db format is generated.
  3. Download the file to the local device.