Network Planning

Figure 1 Deployment logic

Nodes related to training jobs on a deep learning platform include compute nodes and storage nodes. They have the following functions:

  • Compute node: executes training and inference jobs. MindIO TFT is deployed only on compute nodes.
  • Storage node: stores platform data and user data, such as platform logs, datasets uploaded by users, training scripts, and models output after training.

The functions of each network plane are as follows:

  • Service plane: manages cluster services. It connects management and compute nodes.
  • Storage plane: accesses storage nodes. Management and compute nodes are connected to storage nodes.
  • Parameter plane: exchanges parameters between training nodes and connects training nodes for distributed training.
    • The logical deployment diagram provides a comprehensive view of a deep learning platform. MindIO TFT requires only the deployment of a software development kit (SDK) on each compute node, without involving the installation and deployment on storage nodes.
    • The MindIO TFT SDK needs to communicate with each other on compute nodes and generates heartbeat packets. This requires a service plane network. The SDK is deployed on all compute nodes running LLM training in peer-to-peer mode, without distinguishing between management and compute nodes.