Script Adaptation

For details about script adaptation, see Script Adaptation related to resumable training. (You need to ensure that model parameters can be properly loaded after the model script reduces the job scale.) For the hybrid parallel model, in addition to adding the restoration code by referring to Code Adaptation Example of the Hybrid Parallel Model Based on the Pangu_alpha Model, you also need to add the restoration policy check function. The following uses the pangu_alpha code of the r1.9 branch in MindSpore repository as an example to describe how to add the recovery policy check function:

Perform adaptation in the train.py file by referring to the code provided by the hccl_check.py file in Code Repository. The following information in bold is the code:

...
"""
PanguAlpha train script
"""
# Import dependencies.
...
def run_train_pipeline(args_opt):
...

# Refer to the hccl_check.py code.
def hccl_check(need_device_num) -> bool:
...

def get_restore_strategy():
...

if __name__ == "__main__":
    get_restore_strategy()
    opt = get_args()
...

Parent topic: Example of Minimum Service System