Accuracy Data Collection

Prerequisites

Performing Collection

  1. Create a configuration file.
    For example, create a config.json configuration file in the directory where the training script is located and copy the following content to the file:
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    {
        "task": "tensor",
        "dump_path": "./dump_data",
        "rank": [],
        "step": [],
        "level": "L1",
    
        "tensor": {
            "scope": [], 
            "list": [],
            "data_mode": ["all"]
        }
    }
    
  2. Add the tool to the training script mindspore_main.py.
    You can copy the complete code from Code Sample for MindSpore Accuracy Data Collection and execute it directly. The following examples only show where to add the tool API in the script.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    ...
      8 from msprobe.mindspore import PrecisionDebugger
      9 debugger = PrecisionDebugger(config_path="./config.json")
    ...
     47 if __name__ == "__main__":
     48     step = 0
     49     # Train Model
     50     for data, label in ds.GeneratorDataset(generator_net(), ["data", "label"]):
     51         debugger.start(model)
     52         train_step(data, label)
     53         print(f"train step {step}")
     54         step += 1
     55         debugger.stop()
     56         debugger.step()
     57     print("train finish")
    

    Accuracy data occupies certain disk space. As a result, the server may be unavailable if the disk space is used up. The space required by accuracy data is closely related to the model parameters, collection configurations, and number of collection iterations. You need to ensure that there is sufficient available disk space in the directory where accuracy data is flushed.

  3. Run the training script command. The tool collects the accuracy data during model training.
    python mindspore_main.py

    If the following information is displayed, the data has been successfully collected. You can view the data once collection is complete.

    1
    2
    3
    4
    5
    The cell hook function is successfully mounted to the model.
    The module statistics hook function is successfully mounted to the model.
    msprobe: debugger.start() is set successfully
    Dump switch is turned on at step 0.
    Dump data will be saved in /home/user1/dump/dump_data/step0.
    

Viewing Results

The following directory structure is displayed in the path specified by dump_path. Select data for analysis as required.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
dump_data/
├── step0
    └── rank
        ├── construct.json           # Save the hierarchical relationship information of the module. This field is empty in the current scenario.
        ├── dump.json                # Save the input and output statistics and overflow information of the forward and backward APIs.
        ├── dump_tensor_data         # Save the actual data of the input and output tensors of the forward and backward APIs.
           ├── Jit.Momentum.0.forward.input.1.0.npy
           ├── Primitive.matmul.MatMul.1.forward.input.1.npy
           ├── Mint.add.1.backward.input.0.npy
           ├── Primitive.matmul.MatMul.1.forward.output.0.npy
        ...
        └── stack.json               # Save the call stack information of the API.
├── step1
...

Use tools such as those in Accuracy Pre-Check and Accuracy Comparison to further analyze the collected data.