Script Adaptation Description
Corresponding Process
The fault recovery and dying gasp functions of resumable training can be used only after the model script is adapted. You can perform script adaptation by referring to the process in Figure 1. After the checkpoint is saved and loaded, the fault recovery function of resumable training can be implemented. If you want to use the dying gasp function of the resumable training, continue to adapt related scripts.
- Check whether the checkpoint is saved. If yes, go to 3. If no, go to 2.
- Refer to the tutorial on the official website of MindSpore to save the checkpoints.
- Check whether the checkpoint can be loaded. If yes, no further action is required. If no, go to 4.
- Refer to the tutorial on the official website of MindSpore to load the checkpoints.
- To use dying gasp, refer to Script Adaptation or Script Adaptation to adapt the corresponding script.
- Check whether the model is in hybrid parallel mode. If no, no further action is required. If yes, go to 7.
- Check whether the restoration policy to load the dying gasp checkpoints needs to be enabled. If yes, go to 8. If no, no further action is required.
- Refer to Code Adaptation Example of the Hybrid Parallel Model Based on the Pangu_alpha Model to adapt the restoration policy code.
Adaptation Description
The code shown in this section is open source. The user and user group of the related Python and shell scripts must be the same. For security purposes, you are advised to verify the input parameters, file directories, and file paths.
The verification items of the input parameters include but are not limited to the following:
- If external variables are used as a part of a command, strict parameter verification and anti-injection measures are taken.
- If external variables obtained from environment variables are used for command concatenation, strict verification and anti-injection measures are taken.
- All processes should follow the principle of least privilege to avoid serious consequences caused by injection.
- External variables cannot be directly used as commands in the code.
- Security specifications of various programming languages must be complied with.
The verification items of the file paths include but are not limited to the following:
- The path length is limited.
- Special character filtering and anti-bypass mechanisms are provided for paths.
- No command injection exists.
- Processes must follow the principle of least privilege.
- No high-risk path exists in the trustlist.
- The authenticity of the file paths is verified, and exceptions are thrown.
- Command injection is an unexpected behavior caused by controllable external variables.
- The dying gasp function and recovery policy support only Python 3.7 and Python 3.9.
- During script adaptation, you need to locate exceptions and handle them using service logics as required.
