Basic Performance Tuning for Kubernetes Clusters
The MindCluster cluster scheduling components are functional modules built upon the Kubernetes ecosystem. Therefore, resumable training is available only when training job scheduling operates on the Kubernetes platform. The Kubernetes versions compatible with resumable training align with those supported by the MindCluster cluster scheduling components, ranging from 1.17.x to 1.34.x (version 1.19.x or later is recommended).
The following table lists configurations that are recommended for a cluster with 10,000 cards. You can adjust the configurations based on the actual cluster scale.
Configuration Item |
Description |
Recommended Configuration |
Reference File Path |
|---|---|---|---|
API server startup parameters |
--max-request-inflight and --max-mutating-requests-inflight limit the maximum number of read and write requests that can be concurrently processed within a specified period. If the values are too small, errors indicating that the number of requests exceeds the threshold will occur. If the values are too large, excessive memory will be occupied. |
--max-request-inflight=20000 --max-mutating-requests-inflight=2000 |
/etc/kubernetes/manifests/kube-apiserver.yaml |
--watch-cache and --watch-cache-sizes indicate the cache size of the API server. When the API server obtains etcd objects, it preferentially accesses the local cache. If the required information is not in the cache, it accesses etcd and stores etcd data in the cache. If the cache reaches the upper limit, the cache is overwritten. Properly configuring the cache size can improve the etcd obtaining efficiency. |
--watch-cache=true --watch-cache-sizes=node#1000,pod#2000,event#200,namespace#100,service#200 |
/etc/kubernetes/manifests/kube-apiserver.yaml |
|
API Server resources |
The CPU resources configured for the API Server affect the processing capability of the API Server. |
Change the upper limit of CPU resources requested by the API Server to 35 cores. resources:
requests:
cpu: 35000m
NOTE:
The overall CPU usage of the API Server is not limited by this parameter. |
/etc/kubernetes/manifests/kube-apiserver.yaml |
etcd startup parameters |
--quota-backend-bytes indicates the maximum storage space of etcd. The default value is 2 GB. |
Change the value to 8 GB. --quota-backend-bytes=8589934590 |
/etc/kubernetes/manifests/etcd.yaml |
--auto-compaction-retention: performs automatic compression to reduce resource usage. |
Compact fragments to reduce resource usage. --auto-compaction-retention NOTE:
--auto-compaction-retention does not release the space. You need to manually configure it, together with etcdctl compact and etcd defrag, to clear the space. |
||
etcd resources |
The CPU and memory resources configured for etcd affect the etcd processing capability. |
Change the upper limit of CPU resources for etcd requests to 20 cores and the upper limit of memory resources to 10 GB. resources:
requests:
cpu: 20000m
memory: 10000Mi
|
/etc/kubernetes/manifests/etcd.yaml |
Volcano resources |
The CPU and memory resources configured for Volcano affect the Volcano processing capability. |
Change the upper limit of CPU resources for Volcano requests to 20 cores and the upper limit of memory resources to 8 GB. resources:
requests:
cpu: 20000m
memory: 4Gi
|
Reference configuration command: kubectl edit deployment -n volcano-system volcano-scheduler |