Network Planning

Figure 1 Deployment logic

Nodes related to training jobs on a deep learning platform include compute nodes and storage nodes. They have the following functions:

  • Compute node: executes training and inference jobs. MindIO ACP is deployed only on compute nodes.
  • Storage node: stores platform data and user data, such as platform logs, datasets uploaded by users, training scripts, and models output after training.

The functions of each network plane are as follows:

  • Service plane: manages cluster services. It connects management and compute nodes.
  • Storage plane: accesses storage nodes. Management and compute nodes are connected to storage nodes.
  • Parameter plane: exchanges parameters between training nodes and connects training nodes for distributed training.
  • The logical deployment diagram provides a comprehensive view of a deep learning platform. MindIO ACP is a component deployed on compute nodes and does not involve the installation and deployment of management nodes and storage nodes.
  • MindIO ACP is a single-node memory cache system. Checkpoints access to MindIO ACP in shared memory mode, and the network plane division is not involved.