Network Planning
Figure 1 Deployment logic


Nodes related to training jobs on a deep learning platform include compute nodes and storage nodes. They have the following functions:
- Compute node: executes training and inference jobs. MindIO TFT is deployed only on compute nodes.
- Storage node: stores platform data and user data, such as platform logs, datasets uploaded by users, training scripts, and models output after training.
The functions of each network plane are as follows:
- Service plane: manages cluster services. It connects management and compute nodes.
- Storage plane: accesses storage nodes. Management and compute nodes are connected to storage nodes.
- Parameter plane: exchanges parameters between training nodes and connects training nodes for distributed training.
- The logical deployment diagram provides a comprehensive view of a deep learning platform. MindIO TFT requires only the deployment of a software development kit (SDK) on each compute node, without involving the installation and deployment on storage nodes.
- The MindIO TFT SDK needs to communicate with each other on compute nodes and generates heartbeat packets. This requires a service plane network. The SDK is deployed on all compute nodes running LLM training in peer-to-peer mode, without distinguishing between management and compute nodes.
Parent topic: Preparing for Installation