Basic Performance Tuning for Kubernetes Clusters

The MindCluster cluster scheduling components are functional modules built upon the Kubernetes ecosystem. Therefore, resumable training is available only when training job scheduling operates on the Kubernetes platform. The Kubernetes versions compatible with resumable training align with those supported by the MindCluster cluster scheduling components, ranging from 1.17.x to 1.34.x (version 1.19.x or later is recommended).

The following table lists configurations that are recommended for a cluster with 10,000 cards. You can adjust the configurations based on the actual cluster scale.

**Table 1** Configuration description
Configuration Item	Description	Recommended Configuration	Reference File Path
API server startup parameters	--max-request-inflight and --max-mutating-requests-inflight limit the maximum number of read and write requests that can be concurrently processed within a specified period. If the values are too small, errors indicating that the number of requests exceeds the threshold will occur. If the values are too large, excessive memory will be occupied.	--max-request-inflight=20000 --max-mutating-requests-inflight=2000	/etc/kubernetes/manifests/kube-apiserver.yaml
API server startup parameters	--watch-cache and --watch-cache-sizes indicate the cache size of the API server. When the API server obtains etcd objects, it preferentially accesses the local cache. If the required information is not in the cache, it accesses etcd and stores etcd data in the cache. If the cache reaches the upper limit, the cache is overwritten. Properly configuring the cache size can improve the etcd obtaining efficiency.	--watch-cache=true --watch-cache-sizes=node#1000,pod#2000,event#200,namespace#100,service#200	/etc/kubernetes/manifests/kube-apiserver.yaml
API Server resources	The CPU resources configured for the API Server affect the processing capability of the API Server.	Change the upper limit of CPU resources requested by the API Server to 35 cores. resources: requests: cpu: 35000m NOTE: The overall CPU usage of the API Server is not limited by this parameter.	/etc/kubernetes/manifests/kube-apiserver.yaml
etcd startup parameters	--quota-backend-bytes indicates the maximum storage space of etcd. The default value is 2 GB.	Change the value to 8 GB. --quota-backend-bytes=8589934590	/etc/kubernetes/manifests/etcd.yaml
etcd startup parameters	--auto-compaction-retention: performs automatic compression to reduce resource usage.	Compact fragments to reduce resource usage. --auto-compaction-retention NOTE: --auto-compaction-retention does not release the space. You need to manually configure it, together with etcdctl compact and etcd defrag, to clear the space.	/etc/kubernetes/manifests/etcd.yaml
etcd resources	The CPU and memory resources configured for etcd affect the etcd processing capability.	Change the upper limit of CPU resources for etcd requests to 20 cores and the upper limit of memory resources to 10 GB. resources: requests: cpu: 20000m memory: 10000Mi	/etc/kubernetes/manifests/etcd.yaml
Volcano resources	The CPU and memory resources configured for Volcano affect the Volcano processing capability.	Change the upper limit of CPU resources for Volcano requests to 20 cores and the upper limit of memory resources to 8 GB. resources: requests: cpu: 20000m memory: 4Gi	Reference configuration command: kubectl edit deployment -n volcano-system volcano-scheduler

Parent topic: Appendixes