Deterministic Computing Case

Case: The deterministic computing of a vision model is incorrect, that is, repeated training produces varying loss values.

Locating method:

Enable deterministic computing (operator deterministic computing + deterministic communication), set the random seed, disable dropout, and ensure that the dataset reading sequence is consistent.
You can use seed_all of the msprobe tool to automatically achieve the preceding purposes.
1 2
from msprobe.pytorch import seed_all seed_all(mode=True)
After the tool is enabled, the accuracy comparison result is improved. Some card results match, while others differ. Repeated training occurs randomly at varying cards and locations.

Use the dump tool to collect the mix-level data of step 0 and set summary_mode to md5 to highlight minor differences.

The config.json file is configured as follows:

{
    "task": "statistics",
    "dump_path": "/home/data_dump",
    "rank": [],
    "step": [0],
    "level": "mix",
    "enable_dataloader": false,
    "statistics": {
        "scope": [], 
        "list": [],
        "data_mode": ["all"],
        "summary_mode": "md5"
    }
}

The comparison of the data collected before and after shows that the input of masked_fill.23 is the first to be abnormal.

Figure 1 Comparison of the data collected before and after

Identify the input source using the stack.json call stack and the code in the dump results. The code refers to MMCV's MSDA.

Figure 2 Call stack of masked_fill.23

Figure 3 masked_fill.23 code

Confirm with the MSDA operator technical support that the operator does not support deterministic computing. It is recommended that small operators be used instead.

After the small operator is used, the input of masked_fill.23 is consistent, but the grid_sample is still different.

Figure 4 Comparison of the data collected before and after

Confirm with the grid_sample operator technical support that the operator does not support deterministic computing.

Solution: The MSDA operator is converted into small operators for concatenation, and the grid_sample operator is converted into the CPU.

Result: The randomness is fixed, and the repeated training results are the same.

Parent topic: Accuracy Locating Cases