Multi-Server and Multi-Device Training
This section takes the fine-tuning function of a caption task in OPT as an example. For the configuration, see MindSpore Distributed Parallel Training Example.
SERVER_ID in the model startup script needs to be modified based on the server you use. Other modifications are the same as those in Single-Server Multi-Device Training. For details about the complete script after modification, visit the Gitee repository.
export SERVER_ID =0 # Set the parameter value to 0 for the first server and 1 for the second server.
- For details about how to generate RANK_TABLE_FILE, see RANK_TABLE_FILE Generation Method.
- Set ID for each server, for example, 0 for the first server and 1 for the second server.
- To ensure that run logs are correctly flushed to drives, set SERVER_ID and DEVICE_ID to a value in the range of [0, 8) properly.
- The startup script, model code, and rank_table_file.json of each server must be stored in the same path.
Parent topic: Distributed Training Scenarios