Deploying a Single-Node Prefill-Decode Disaggregation Service Using kubectl
Restrictions
- This mode supports the deployment of Server, Coordinator, and Controller in prefill-decode disaggregation scenarios.
- For Atlas 800I A2 inference server, the NPU IP address needs to be configured and Ascend Operator needs to be installed.
- The current deployment script does not support rescheduling upon NPU faults.
Script Description
This section describes how to use the scripts in the MindIE Motor installation directory (examples/kubernetes_deploy_scripts) to deploy and uninstall a prefill-decode disaggregation cluster on MindIE in one-click mode. The cluster administrator can use Kubernetes kubectl to operate the cluster offline by referring to these scripts.
The cluster administrator only needs to compile the startup script, configure services and Kubernetes on the management node, and then call the deployment script to automatically deliver the service configuration and startup script, generate a global ranktable, and schedule pods to compute nodes.
- The MindIE installation and deployment scripts must be executed by the Kubernetes administrator to prevent malicious tampering with scripts or configurations, which may cause arbitrary command execution or container escape risks.
- The Kubernetes administrator must strictly control the write, update, and delete permissions of the MindIE ConfigMap. It is recommended that the permission on the installation directory be set to 750 and the permission on the file be set to 640. To prevent security risks caused by malicious modification and mounting to the pod, you are advised to use namespaces and RBAC to restrict permissions.
- If the TLS certificate is enabled for the MindIE service, you need to remove the three health probe configurations (readinessProbe, livenessProbe, startupProbe, and subfields) of Kubernetes from the controller_init.yaml and coordinator_init.yaml files. Otherwise, the service may be frequently restarted by Kubernetes. The MindIE service ensures high reliability through mechanisms such as active/standby switchover and pod rescheduling. Therefore, removing these health probes does not compromise service reliability.
- When requests are sent faster than they are processed, the Coordinator caches the unprocessed requests, leading to increased memory usage. As a result, request sending may be terminated because the memory usage reaches the upper limit. In this case, you need to appropriately increase the values of memory under the requests and limits fields in the coordinator_init.yaml file.
- The memory parameter under the requests field indicates the minimum memory required for Coordinator running.
- The memory parameter under the limits field indicates the upper limit of the memory available to the Coordinator.
To ensure that the Coordinator can reliably obtain the memory, the values of memory under the requests and limits fields should be the same (if possible). By default, the recommended memory specifications are as follows:
- 12Gi for a maximum of 10,000 concurrent requests
- 24Gi for a maximum of 20,000 concurrent requests
- 48Gi for a maximum of 40,000 concurrent requests
- 108Gi for a maximum of 90,000 concurrent requests
Maximum memory that can be occupied by the Coordinator = Value of body_limit x Value of max_requests x 120%
- body_limit: maximum number of bytes in a single request message body, in MB. You can configure and view the value in the ms_coordinator.json configuration file. For parameter details, see Parameters in the ms_coordinator.json Startup Configuration File.
- max_requests: maximum number of concurrent requests. You can configure and view the value in the ms_coordinator.json configuration file. For parameter details, see Parameters in the ms_coordinator.json Startup Configuration File.
Directory structure of scripts:
├── boot_helper
│ ├── boot.sh
│ ├── gen_config_single_container.py
│ ├── get_group_id.py
│ ├── mindie_cpu_binding.py
│ ├── server_prestop.sh
│ └── update_mindie_server_config.py
├── chat.sh
├── conf
├── delete.sh
├── deploy.sh
├── deploy_ac_job.py
├── deployment
│ ├── controller_init.yaml
│ ├── coordinator_init.yaml
│ ├── mindie_ms_controller.yaml
│ ├── mindie_ms_coordinator.yaml
│ ├── mindie_server.yaml
│ ├── mindie_server_heterogeneous.yaml
│ ├── mindie_service_single_container.yaml
│ ├── mindie_service_single_container_base_A3.yaml
│ ├── server_init.yaml
│ └── single_container_init.yaml
├── gen_ranktable_helper
│ ├── gen_global_ranktable.py
│ └── global_ranktable.json
├── generate_stream.sh
├── log.sh
├── user_config.json
├── user_config_base_A3.json
└── utils
├── validate_config.py
└── validate_utils.py
The following table describes the key directories and files for single-node prefill-decode disaggregation deployment.
Directory/File |
Description |
|---|---|
conf |
Main service configuration file of the cluster management component and Server, which is used to manage scheduling policies and model configurations in the prefill-decode disaggregation scenario. |
boot_helper |
Contains the container startup script boot.sh. It modifies configuration files in the conf directory (i.e. generates several configurations files with unique port assignments for one Server config.json file) and generates the auxiliary script gen_config_single_container.py corresponding global_ranktable. NOTE:
MindIE depends on the jemalloc.so library file. Do not install invalid SO files with the same name in the /usr/ directory, as this may introduce security risks such as arbitrary command execution. |
deployment |
Defines a Kubernetes deployment task and configures the NPU resource usage, number of instances, and image. For single-node prefill-decode disaggregation deployment, only the mindie_service_single_container.yaml file is used. |
chat.sh |
Simple dialog example of using curl to send HTTPS requests to the inference service, which is applicable to the prefix cache scenario. |
generate_stream.sh |
Streaming response example for sending an HTTPS request to the inference service using curl. |
deploy.sh |
Deployment script, which is used to start all MindIE components in one-click mode. |
delete.sh |
Uninstallation script, which is used to uninstall all MindIE components in one-click mode. |
log.sh |
Queries the printed logs of all deployed pods. |
Procedure
The LLaMA3-8B with four instances is used as an example, each instance configured with two NPUs. The following is a deployment example. Perform the following operations in the deployment script path.
- Log in to the host of the cluster management node and create a namespace for the first deployment. The default value is mindie. Replace it as required.
kubectl create namespace mindie
- Configure the startup configuration file ms_controller.json of Controller. For details about the configuration file, see Configuration Description.Set deploy_mode to pd_disaggregation_single_container, to enable single-node prefill-decode disaggregation deployment.
"deploy_mode": "pd_disaggregation_single_container"
- The request_coordinator_tls_enable, request_server_tls_enable, http_server_tls_enable, and cluster_tls_enable parameters in the ms_controller.json file specify whether to enable HTTPS. You are advised to enable HTTPS (by setting these parameters to true) to ensure communication security. If HTTPS is disabled, high network security risks exist.
- true: HTTPS enabled for MindIE components in the cluster. You need to import certificates to the container and configure the corresponding certificate paths.
- false: HTTP enabled for MindIE components in the cluster. You do not need to prepare certificates.
- All components for single-node prefill-decode disaggregation deployment are located in a single pod. The default port number of the http_server module is 1026, which conflicts with other port numbers. You need to change the port number to another non-conflicting port number. The recommended value is 1027. (If the port specified in the configuration file conflicts with an existing port, the program will automatically assign an available, non-conflicting port to ensure successful startup.)
- default_p_rate and default_d_rate in the ms_controller.json file control the ratio of prefill nodes to decode nodes in the cluster. The default values are both 0. The optimal ratio is automatically determined based on the model, hardware, and service information. You can also set the ratio to the actual number of prefill nodes and decode nodes based on the scenario situation. If the environment variables MINDIE_MS_P_RATE and MINDIE_MS_D_RATE are set, their values are preferentially read. For details, see Environment Variable.
- The request_coordinator_tls_enable, request_server_tls_enable, http_server_tls_enable, and cluster_tls_enable parameters in the ms_controller.json file specify whether to enable HTTPS. You are advised to enable HTTPS (by setting these parameters to true) to ensure communication security. If HTTPS is disabled, high network security risks exist.
- Configure the startup configuration file ms_coordinator.json of Coordinator. For details about the configuration file, see Configuration Description.
For single-node prefill-decode disaggregation deployment, set deploy_mode to pd_separate.
"deploy_mode": "pd_disaggregation_single_container"
The controller_server_tls_enable, request_server_tls_enable, mindie_client_tls_enable, and mindie_mangment_tls_enable parameters in the ms_coordinator.json file specify whether to enable HTTPS. You are advised to enable HTTPS (by setting these parameters to true) to ensure communication security. If HTTPS is disabled, high network security risks exist.- true: HTTPS enabled for MindIE components in the cluster. You need to import certificates to the container and configure the corresponding certificate paths.
- false: HTTP enabled for MindIE components in the cluster. You do not need to prepare certificates.
- Configure the config.json file for starting the Server service. Table 2 lists the parameters need to be configured for the prefill-decode disaggregation deployment mode. For details about the parameters, see "Core Concepts and Configurations" > "Configuration Parameters (Serving)" in MindIE LLM Development Guide.
Table 2 Key parameters in config.json Parameter
Description
modelName
Model name, which is associated with the model weight file, for example, llama3-8b.
modelWeightPath
Directory of the model weight file, which must be set to the weight path mounted to the container specified in the mindie_service_single_container.yaml file. The default path is /mnt/mindie-service/ms/model. Ensure that the cluster can schedule the Ascend compute node's model file existing in this path.
worldSize
Number of NPUs occupied by a prefill/decode instance. For example, if this parameter is set to 2, two NPUs are used.
npuDeviceIds
NPU ID, starting from 0. The total number of IDs is the same as the value of worldSize, for example, [[0,1]].
inferMode
Set this parameter to dmi.
tp
Number of tensor parallel processes on the entire network. This number is the value of worldSize. This parameter is a supplementary parameter and needs to be configured under the ModelConfig field.
In single-node prefill-decode deployment, multiple Server processes run in one pod. Each Server process requires an independent configuration file. To simplify the configuration process, you can use the MINDIE_MS_GEN_SERVER_PORT environment variable in the mindie_service_single_container.yaml file for management. This environment variable supports two configuration modes:
- true (default value): The system automatically generates multiple configuration files (config1.json, config2.json, ..., config{server_num}.json) based on the user-defined config.json file. Each configuration file corresponds to a Server process. The system assigns non-conflicting port numbers to port, managementPort, metricsPort, and interCommPort in the configuration file to ensure that each process runs independently.
- false: You need to provide a set of configuration files (config1.json, config2.json, ..., config{server_num}.json) that match the number of Servers. In this mode, you can configure the parameters of each port. The system has a port conflict detection mechanism. If a port conflict occurs in the configuration file, the system automatically allocates other available port numbers to ensure that the program can be started properly.
- Configure the http_client_ctl.json configuration file, which is used to configure the HTTP client tool for cluster startup, liveness, and readiness probes. For details about the parameters, see Table 4.
tls_enable specifies whether to enable HTTPS. If the MindIE components in the cluster use the HTTPS interface, set tls_enable to true, import certificates to the container, and configure the corresponding certificate paths. To use the HTTP interface, set tls_enable to false. No certificate file is required.
You are advised to enable tls_enable to ensure communication security. If tls_enable is disabled, high network security risks exist.
- Configure the Kubernetes Deployment. describes the main parameters.
Atlas 800I A2 inference server: Configure the mindie_service_single_container.yaml file in the deployment directory among the deployment script directories.
Atlas 800I A3 SuperPoD Server: Configure the mindie_service_single_container_base_A3.yaml file in the deployment directory among the deployment script directories.
- The script is for reference only. You need to ensure the security of the pod container. In the actual production environment, harden the security of the image and pod.
- When using kubectl to deploy a Deployment, you can modify the YAML configuration file of the Deployment. Do not use dangerous configurations, and ensure that a secure image (non-root user) is used to configure secure pod contexts.
- You must mount secure paths (non-soft links, non-dangerous system paths, and non-service sensitive paths) and set proper directory permissions to prevent public directories such as /home from being mounted and prevent container escape caused by tampering by unauthorized users.
- The mount path of the model weight on the host must be configured as required. The following is an example:
- name: model-path hostPath: path: /data/LLaMA3-8B
- Fields to be configured in the mindie_service_single_container.yaml file:
- huawei.com/Ascend910: total number of NPUs occupied by all prefill/decode instances, which is the same as the number specified by worldSize in all Server configuration files.
- sp-block: size of a SuperPoD block, indicating the number of NPUs on a virtual SuperPoD. This parameter is configured only when Atlas 800I A3 SuperPoD Server is used (that is, the mindie_service_single_container_base_A3.yaml file is used). The value of this parameter must be the same as that of huawei.com/Ascend910.
- image: image name.
- MindIE and ATB Models used together: see MindIE and ATB Models Used Together.
- MindIE and MindSpore used together: see MindIE and MindSpore Used Together.
- startupProbe: startup probe, which checks the startup status every 180 seconds. If the startup probe fails for 30 consecutive times, the pod is considered to be not started successfully and will be automatically restarted. Set a proper startup time as required.
- readinessProbe: readiness probe, which checks the readiness status every 180 seconds. If the probe fails, the pod stops receiving traffic until the check is passed. Set a proper triggering time as required.
- livenessProbe: liveness probe, which checks the liveness status every 180 seconds. It is used to detect the container health. If any process does not respond, the container is restarted. Set a proper triggering time as required.
- MINDIE_MS_GEN_SERVER_PORT: determines whether multiple configuration files of Server are automatically generated by the program based on the basic config.json file.
- MINDIE_MS_P_RATE: proportion of prefill nodes in prefill-decode disaggregation deployment mode.
- 0: The optimal proportion is automatically determined. If this parameter is set to 0, MINDIE_MS_D_RATE must also be set to 0.
- Non-zero value: The proportion of prefill nodes is specified. If this parameter is set to a non-zero value, MINDIE_MS_D_RATE must also be set to a non-zero value, and the sum of MINDIE_MS_P_RATE and MINDIE_MS_D_RATE must be less than or equal to the number of NPUs in a single-node system.
The priority is higher than that of default_p_rate in the ms_controller.json configuration file.
- MINDIE_MS_D_RATE: proportion of decode nodes in prefill-decode disaggregation deployment mode.
- 0: The optimal proportion is automatically determined. If this parameter is set to 0, MINDIE_MS_P_RATE must also be set to 0.
- Non-zero value: The proportion of prefill nodes is specified. If this parameter is set to a non-zero value, MINDIE_MS_P_RATE must also be set to a non-zero value, and the sum of MINDIE_MS_P_RATE and MINDIE_MS_D_RATE must be less than or equal to the number of NPUs in a single-node system.
The priority is higher than that of default_d_rate in the ms_controller.json configuration file.
- MINDIE_LOG_TO_FILE: whether to write logs of each MindIE component to a file.
The default value is 1, indicating that logs are written into a file. The value range is [false, true] or [0, 1].
- MINDIE_LOG_TO_STDOUT: whether to print logs of each MindIE component.
The default value is 1, indicating that logs are printed. The value range is [false, true] or [0, 1].
- MINDIE_LOG_LEVEL: log level of each MindIE component.
The default value is INFO. The log levels are [CRITICAL, ERROR, WARN, INFO, DEBUG].
Table 3 Main configuration parameters Parameter
Value Type
Value Range
Description
huawei.com/Ascend910
Int
Total number of NPUs occupied by all prefill/decode instances, which is the same as the number specified by worldSize in all Server configuration files.
sp-block
Size of a SuperPoD block, indicating the number of NPUs on a virtual SuperPoD.
This parameter is configured only when Atlas 800I A3 SuperPoD Server is used (that is, the mindie_service_single_container_base_A3.yaml file is used). The value of this parameter must be the same as that of huawei.com/Ascend910.
image
Image name.
- MindIE and ATB Models used together: see MindIE and ATB Models Used Together.
- MindIE and MindSpore used together: see MindIE and MindSpore Used Together.
startupProbe
Startup probe, which checks the startup status every 180 seconds. If the startup probe fails for 30 consecutive times, the pod is considered to be not started successfully and will be automatically restarted. Set a proper startup time as required.
readinessProbe
Readiness probe, which checks the readiness status every 180 seconds. If the probe fails, the pod stops receiving traffic until the check is passed. Set a proper triggering time as required.
livenessProbe
Liveness probe, which checks the liveness status every 180 seconds. It is used to detect the container health. If any process does not respond, the container is restarted. Set a proper triggering time as required.
MINDIE_MS_GEN_SERVER_PORT
Whether multiple configuration files of Server are automatically generated by the program based on the original config.json file.
MINDIE_MS_P_RATE
Proportion of prefill nodes in prefill-decode disaggregation deployment mode.
- 0: The optimal proportion is automatically determined. If this parameter is set to 0, MINDIE_MS_D_RATE must also be set to 0.
- Non-zero value: The proportion of prefill nodes is specified. If this parameter is set to a non-zero value, MINDIE_MS_D_RATE must also be set to a non-zero value, and the sum of MINDIE_MS_P_RATE and MINDIE_MS_D_RATE must be less than or equal to the number of NPUs in a single-node system.
The priority is higher than that of default_p_rate in the ms_controller.json configuration file.
MINDIE_MS_D_RATE
Proportion of decode nodes in prefill-decode disaggregation deployment mode.
- 0: The optimal proportion is automatically determined. If this parameter is set to 0, MINDIE_MS_P_RATE must also be set to 0.
- Non-zero value: The proportion of prefill nodes is specified. If this parameter is set to a non-zero value, MINDIE_MS_P_RATE must also be set to a non-zero value, and the sum of MINDIE_MS_P_RATE and MINDIE_MS_D_RATE must be less than or equal to the number of NPUs in a single-node system.
The priority is higher than that of default_d_rate in the ms_controller.json configuration file.
MINDIE_LOG_TO_FILE
- true or 1: Write logs to a file.
- false or 0: Do not write logs to a file.
Whether to write logs of each MindIE component to a file. The default value is 1.
MINDIE_LOG_TO_STDOUT
- true or 1: Print logs.
- false or 0: Do not print logs.
Whether to print logs of each MindIE component. The default value is 1.
MINDIE_LOG_LEVEL
- CRITICAL
- ERROR
- WARN
- INFO
- DEBUG
Log level of each MindIE component. The default value is INFO.
- Configure the boot.sh script. For details about the environment variables that can be configured, see Table 4.
Table 4 Environment variables Environment Variable
Type
Description
MINDIE_INFER_MODE
Prefill-decode disaggregation
Inference mode, indicating whether prefill-decode disaggregation is enabled.
- standard: prefill-decode hybrid deployment
- dmi: prefill-decode disaggregation
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_DECODE_BATCH_SIZE
Public variable
Maximum decode batch size.
Value range: [1, 5000]
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_PREFILL_BATCH_SIZE
Public variable
Maximum prefill batch size.
Value range: [1, MINDIE_DECODE_BATCH_SIZE - 1]
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_MAX_SEQ_LEN
Public variable
Maximum sequence length.
Integer in the range of (0, 4294967295]
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_MAX_ITER_TIMES
Public variable
Maximum output length.
Integer in the range of [1, MINDIE_MAX_SEQ_LEN – 1]
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_MODEL_NAME
Public variable
Model name.
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_MODEL_WEIGHT_PATH
Public variable
Path of the model weight file.
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_ENDPOINT_HTTPS_ENABLED
Public variable
Whether to enable HTTPS on the prefill/decode instance.
- true: enabled
- false: disabled
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
MINDIE_INTER_COMM_TLS_ENABLED
Public variable
Whether to enable TLS for communication between inference instances.
- true: enabled
- false: disabled
This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.
HSECEASY_PATH
Public variable
Path of the dependency library of the KMC decryption tool.
MINDIE_MS_CONTROLLER_CONFIG_FILE_PATH
Public variable
Configuration file path of the Controller component.
MINDIE_MS_COORDINATOR_CONFIG_FILE_PATH
Public variable
Configuration file path of the Coordinator component.
ATB_LLM_HCCL_ENABLE
Public variable
Whether to enable the HCCL communication backend. This variable is enabled by default.
- 1: enabled
- 0: disabled
When the single-node prefill-decode disaggregation service is deployed using Atlas 800I A2 inference server to run dense models, you are advised to disable this environment variable to improve performance.
HCCL_OP_EXPANSION_MODE
Public variable
Location for expanding the orchestration of the communication algorithm. The default value is AIV.
This environment variable takes effect when ATB_LLM_HCCL_ENABLE is set to 1. The values are as follows:
- AI_CPU: AI CPU compute unit on the device
- AIV: Vector Core compute unit on the device
- HOST: CPU on the host. The device automatically selects a Coordinator based on the hardware model.
- HOST_TS: CPU on the host. The host delivers tasks to the task scheduler on the device, and the task scheduler on the device schedules and executes the tasks.
For details about this environment variable, see "HCCL_OP_EXPANSION_MODE" in CANN Environment Variable Reference (Community Edition).
HCCL_RDMA_RETRY_CNT
Public variable
Number of retransmission times of the RDMA NIC. The value must be an integer ranging from 1 to 7. The default value is 7.
HCCL_RDMA_TIMEOUT
Public variable
Retransmission timeout of the RDMA NIC.
The formula for calculating the minimum retransmission timeout of the RDMA NIC is 4.096 μs * 2 ^ timeout. In the formula, timeout is the value of this environment variable, and the actual retransmission timeout is related to the user network status.
Set this environment variable to an integer ranging from [5,20]. The default value is 18.
HCCL_EXEC_TIMEOUT
Public variable
Synchronization wait time during task execution between devices. Within this configured time, each device process waits for other devices to perform communication synchronization. This variable is used to set the timeout interval of the first token.
The value range is [0, 2147483647], in seconds. The default value is 60. The value 0 indicates that there is no timeout.
Note: For details about log-related environment variables, see Log Configuration.
- Start the single-node prefill-decode disaggregation service.
Configure the MindIE installation directory in the container. Change the value of MINDIE_USER_HOME_PATH based on the actual installation path during image creation. For example, if the installation path is /xxx/Ascend/mindie, set the value to /xxx.
export MINDIE_USER_HOME_PATH={Image installation path}- Atlas 800I A2 inference server
Run the following command to start the cluster:
bash deploy.sh
or
bash deploy.sh 800i_a2
- Atlas 800I A3 SuperPoD Server
Run the following command to start the cluster:
bash deploy.sh 800i_a3
- Atlas 800I A2 inference server
- Run the kubectl command to check the prefill-decode cluster status.
kubectl get pods -n mindie
Command output:
1 2
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES mindie-server-7b795f8df9-vl9hv 1/1 Running 0 145m xx.xx.xx.xx ubuntu <none> <none>
The Controller, Coordinator, and Server components are started in the pods whose names start with mindie-server.
If the pod is in the Running status, the pod container has been successfully scheduled to a node and started. However, you need to further check whether the service program is started successfully.
- You can use the provided log.sh script to query the standard output logs of pods and check whether a program exception occurs.
bash log.sh
- To query the logs of a specific pod (for example, mindie-server-7b795f8df9-vl9hv), run the following command:
kubectl logs mindie-server-7b795f8df9-vl9hv -n mindie
- The obtained pod logs are output to the global_ranktable.json file.
- The following is an example of the global_ranktable.json file generated by Atlas 800I A2 inference server. Table 5 describes the parameters in the example.
{ "version": "1.0", "server_group_list": [ { "group_id": "2", "server_count": "4", "server_list": [ { "server_id": "127.0.0.1", "server_ip": "127.0.0.1", "predict_port": "xxxx", "mgmt_port": "xxxx", "metric_port": "xxxx", "inter_comm_port": "xxxx", "device": [ { "device_id": "0", "device_ip": "1.1.1.1", "rank_id": "0", "device_logical_id": "0" }, { "device_id": "1", "device_ip": "1.1.1.2", "rank_id": "1", "device_logical_id": "1" } ] }, { "server_id": "127.0.0.1", "server_ip": "127.0.0.1", "predict_port": "xxxx", "mgmt_port": "xxxx", "metric_port": "xxxx", "inter_comm_port": "xxxx", "device": [ { "device_id": "2", "device_ip": "1.1.1.3", "rank_id": "2", "device_logical_id": "2" }, { "device_id": "3", "device_ip": "1.1.1.4", "rank_id": "3", "device_logical_id": "3" } ] }, { "server_id": "127.0.0.1", "server_ip": "127.0.0.1", "predict_port": "xxxx", "mgmt_port": "xxxx", "metric_port": "xxxx", "inter_comm_port": "xxxx", "device": [ { "device_id": "4", "device_ip": "1.1.1.5", "rank_id": "4", "device_logical_id": "4" }, { "device_id": "5", "device_ip": "1.1.1.6", "rank_id": "5", "device_logical_id": "5" } ] }, { "server_id": "127.0.0.1", "server_ip": "127.0.0.1", "predict_port": "xxxx", "mgmt_port": "xxxx", "metric_port": "xxxx", "inter_comm_port": "xxxx", "device": [ { "device_id": "6", "device_ip": "1.1.1.7", "rank_id": "6", "device_logical_id": "6" }, { "device_id": "7", "device_ip": "1.1.1.8", "rank_id": "7", "device_logical_id": "7" } ] } ] }, { "group_id": "1", "server_count": "1", "server_list": [ { "server_ip": "127.0.0.1" } ] }, { "group_id": "0", "server_count": "1", "server_list": [ { "server_ip": "127.0.0.1" }, ] } ], "status": "completed" } - The following is an example of the global_ranktable.json file generated by Atlas 800I A3 SuperPoD Server. Table 5 describes the parameters in the example.
{ "version": "1.0", "server_group_list": [ { "group_id": "2", "server_count": "4", "server_list": [ { "server_id": "127.0.0.1", "server_ip": "127.0.0.1", "predict_port": "xxxx", "mgmt_port": "xxxx", "metric_port": "xxxx", "inter_comm_port": "xxxx", "device": [ { "device_id": "0", "device_ip": "1.1.1.1", "super_device_id": "xxxxxxxxxx", "rank_id": "0", "device_logical_id": "0" }, { "device_id": "1", "device_ip": "1.1.1.2", "super_device_id": "xxxxxxxxxx", "rank_id": "1", "device_logical_id": "1" } ] }, { "server_id": "127.0.0.1", "server_ip": "127.0.0.1", "predict_port": "xxxx", "mgmt_port": "xxxx", "metric_port": "xxxx", "inter_comm_port": "xxxx", "device": [ { "device_id": "2", "device_ip": "1.1.1.3", "super_device_id": "xxxxxxxxxx", "rank_id": "2", "device_logical_id": "2" }, { "device_id": "3", "device_ip": "1.1.1.4", "super_device_id": "xxxxxxxxxx", "rank_id": "3", "device_logical_id": "3" } ] }, { "server_id": "127.0.0.1", "server_ip": "127.0.0.1", "predict_port": "xxxx", "mgmt_port": "xxxx", "metric_port": "xxxx", "inter_comm_port": "xxxx", "device": [ { "device_id": "4", "device_ip": "1.1.1.5", "super_device_id": "xxxxxxxxxx", "rank_id": "4", "device_logical_id": "4" }, { "device_id": "5", "device_ip": "1.1.1.6", "super_device_id": "xxxxxxxxxx", "rank_id": "5", "device_logical_id": "5" } ] }, { "server_id": "127.0.0.1", "server_ip": "127.0.0.1", "predict_port": "xxxx", "mgmt_port": "xxxx", "metric_port": "xxxx", "inter_comm_port": "xxxx", "device": [ { "device_id": "6", "device_ip": "1.1.1.7", "super_device_id": "xxxxxxxxxx", "rank_id": "6", "device_logical_id": "6" }, { "device_id": "7", "device_ip": "1.1.1.8", "super_device_id": "xxxxxxxxxx", "rank_id": "7", "device_logical_id": "7" } ] } ], "super_pod_list": [ { "super_pod_id": "0", "server_list": [ { "server_id": "127.0.0.1" } ] } ] }, { "group_id": "1", "server_count": "1", "server_list": [ { "server_ip": "127.0.0.1" } ] }, { "group_id": "0", "server_count": "1", "server_list": [ { "server_ip": "127.0.0.1" }, ] } ], "status": "completed" }
Table 5 Parameters in the global_ranktable.json file Parameter
Type
Description
version
String
Version number of Ascend Operator.
status
String
Cluster information status.
- completed: deployment completed
- initializing
group_id
String
ID of each component.
- 0: Coordinator deployment information
- 1: Controller deployment information
- 2: Server deployment information
server_count
String
Total number of processes of each component.
server_list
JSON object array
Process deployment information of each component.
- Controller instance: The valid list length is [0, 1].
- Coordinator instance: The valid list length is [0, 1].
- Server instance: The valid list length is [0, npu_num]. (npu_num indicates the number of NPUs.)
server_id
String
Node host IP address of a component.
server_ip
String
Node IP address of a component.
predict_port
String
Port number bound to the service-plane RESTful interface provided by EndPoint.
mgmt_port
String
Port number bound to the internal interface provided by EndPoint.
metric_port
String
Port number of the service management and control metric interface (Prometheus format).
inter_comm_port
String
Communication port between instances in a cluster.
device
JSON object array
NPU device information. Only Server has this attribute. The valid list length is [1, 128].
device_id
String
NPU device ID.
device_ip
String
NPU IP address.
super_device_id
String
ID of an NPU device on a SuperPoD. This parameter is involved only in Atlas 800I A3 SuperPoD Server.
rank_id
String
Logical ID of the NPU, that is, the sequence ID of the visible device in the pod where Server is located.
device_logical_id
String
Logical ID of the NPU, that is, the sequence ID of the visible device in the pod where Server is located.
super_pod_list
String
SuperPoD list. This parameter is involved only in Atlas 800I A3 SuperPoD Server.
super_pod_id
String
ID of the current SuperPoD. This parameter is involved only in Atlas 800I A3 SuperPoD Server.
- The following is an example of the global_ranktable.json file generated by Atlas 800I A2 inference server. Table 5 describes the parameters in the example.
- To access the container to search for more information, run the following command:
kubectl exec -it mindie-server-7b795f8df9-vl9hv -n mindie -- bash
- You can use the provided log.sh script to query the standard output logs of pods and check whether a program exception occurs.
- Use the provided generate_stream.sh script to initiate a streaming inference request.After the deployment is successful, port 31015 is opened on the node for the inference service interface. You need to change the IP address in generate_stream.sh to the IP address of the host IP address of the cluster management node. If HTTPS is enabled for Coordinator, you need to configure related certificates. If HTTP is used, change HTTPS in the script to HTTP and delete the certificate configuration.
bash generate_stream.sh
HTTPS is recommended, as it is more secure than HTTP.
- Delete the prefill-decode cluster.To stop the prefill-decode service or modify the service configuration for instance redeployment, run the following command to delete the deployed instance. Then, redeploy an instance by referring to 8.
bash delete.sh mindie ./
- mindie indicates the namespace created in 1. Replace it as required.
- The uninstallation script delete.sh must be executed in the examples/kubernetes_deploy_scripts directory. Otherwise, the service cannot be stopped and an error is reported.