Network Requirements

Volcano, a core cluster scheduling component, is deployed on the Kubernetes management node. To ensure service stability, the following suggestions are provided for management node deployment based on Kubernetes deployment requirements. You can adjust the requirements based on your service characteristics.
  • Separate the management node from the compute and storage nodes. You are advised to deploy the management node on an independent server.
  • If the cluster scale is large or high service reliability is required, deploy the management node in multi-node mode.

Deployment Logic

Figure 1 Deployment logic

Nodes in a data center cluster are classified into the following types:

  • Management node (master node): manages clusters, distributes training or inference jobs to each compute node for execution, and supports the master node-related cluster scheduling components.
  • Compute node (worker node): executes training or inference jobs and supports the worker node-related cluster scheduling components.
  • Storage node: stores datasets and trained models.

The network planes are divided into the following types:

  • Service plane: manages Kubernetes cluster services.
  • Storage plane: reads datasets used for training from storage nodes. Due to bandwidth requirements, you are advised to deploy an independent network to connect training nodes (management nodes or compute nodes) to storage nodes.
  • Parameter plane: exchanges parameters among training nodes during distributed training.