Pre-Training Configuration Check

Based on the samples in this document, you need to add the tool APIs to the training script for configuration check.

The check needs to be performed in the GPU and Ascend NPU environments, respectively.

Prerequisites

You have performed operations in Environment Setup.
You have performed operations in Model Development and Migration. Ensure that the training job can be properly executed in all sample environments.

Environment Setup

Run the following command to install msprobe in the Ascend NPU environment:

pip install mindstudio-probe

Performing a Check

The procedure is as follows:

Obtain the .zip packages of the two environments (including the environment configurations that affect the training accuracy: environment variables, third-party library versions, training hyperparameters, weights, datasets, and random operations).
Perform the following operations in the GPU and Ascend NPU environments, respectively. Name the two .zip packages differently.

You can copy the complete code from Code Sample for PyTorch Pre-Training Configuration Check and execute it directly. The following examples only show where to add the tool API in the script.
1. Insert the following code at the beginning of the first Python script executed in the training process:
  1 2
  1 from msprobe.core.config_check import ConfigChecker 2 ConfigChecker.apply_patches(fmk)
  fmk is a string parameter indicating the training framework, which can be set to pytorch or mindspore. In this example, pytorch is used.
2. Insert the following code after the model is initialized:
  1 2
  172 from msprobe.core.config_check import ConfigChecker 173 ConfigChecker(model, output_zip_path, fmk)
  - model indicates the initialized model. By default, the weight and dataset are not collected.
  - output_zip_path indicates the path of the output .zip package. The value is of the string type. You need to specify the name of the .zip package. The default value is ./config_check_pack.zip.
  - fmk indicates the training framework. The value can be set to pytorch or mindspore. In this example, pytorch is used.
3. Run the training script.
```
python pytorch_main.py -a resnet50 -b 32 --gpu 1 --dummy
```
  After the collection is complete, a .zip package is generated, which contains the configurations that affect the accuracy. The data is stored by rank and step (micro_step).
Upload the two .zip packages to the same environment and run the following command to compare them:
```
msprobe -f pytorch config_check -c bench_zip_path cmp_zip_path -o output_path 
```
bench_zip_path indicates the name of the .zip package collected on the benchmark side, and cmp_zip_path indicates the name of the .zip package collected on the comparison side.

The default value of output_path is ./config_check_result.

After the preceding command is executed, the following data is generated in output_path:
- bench indicates the data packaged in bench_zip_path.
- cmp indicates the data packaged in cmp_zip_path.
- result.xlsx indicates the comparison results. There are multiple sheets. The summary sheet shows the overall results, and other sheets show the details of specific check items. The step is micro_step.
Check the results.
If the following five items are the same in both environments, the check passes. If they differ, adjust the environment accordingly.
- Environment variable
- Third-party library version
- Dataset
- Weight
- Training hyperparameter
See the following example:
1 2 3 4 5 6
filename ass_check env TRUE pip TRUE dataset TRUE weights TRUE random TRUE

Parent topic: Model Accuracy Debugging