Network Requirements

The core scheduling component Volcano for cluster scheduling is deployed on the Kubernetes master node. To ensure service stability, the following suggestions are provided for master node deployment based on Kubernetes deployment requirements. You can adjust the requirements based on your service characteristics.

Separate the master node from the worker and storage nodes. You are advised to deploy the master node on an independent server.
If the cluster scale is large or high service reliability is required, deploy the master node in multi-node mode.

Deployment Logic

Figure 1 Deployment logic

Nodes in a data center cluster are classified into the following types:

Master node: manages clusters, distributes training or inference jobs to each worker node for execution, and supports the master node-related cluster scheduling components.
Worker node: executes training or inference jobs and supports the worker node-related cluster scheduling components.
Storage node: stores datasets and trained models.

You need to divide network planes into the following types:

Service plane: manages Kubernetes cluster services.
Storage plane: reads datasets used for training from storage nodes. Due to bandwidth requirements, you are advised to deploy an independent network to connect training nodes to storage nodes.
Parameter plane: exchanges parameters between training nodes and connects training nodes during distributed training. For details, see the Ascend Training Solution 23.0.RC1 Networking Guide, which describes how to set up the networking using Huawei training computing devices (including Atlas 800 training server and Atlas 900 PoD (model 9000)).

Parent topic: Hardware