Deploying a Single-Node Prefill-Decode Disaggregation Service Using kubectl

Restrictions

  • This mode supports the deployment of Server, Coordinator, and Controller in prefill-decode disaggregation scenarios.
  • For Atlas 800I A2 inference server, the NPU IP address needs to be configured and Ascend Operator needs to be installed.
  • The current deployment script does not support rescheduling upon NPU faults.

Script Description

This section describes how to use the scripts in the MindIE Motor installation directory (examples/kubernetes_deploy_scripts) to deploy and uninstall a prefill-decode disaggregation cluster on MindIE in one-click mode. The cluster administrator can use Kubernetes kubectl to operate the cluster offline by referring to these scripts.

The cluster administrator only needs to compile the startup script, configure services and Kubernetes on the management node, and then call the deployment script to automatically deliver the service configuration and startup script, generate a global ranktable, and schedule pods to compute nodes.

  • The MindIE installation and deployment scripts must be executed by the Kubernetes administrator to prevent malicious tampering with scripts or configurations, which may cause arbitrary command execution or container escape risks.
  • The Kubernetes administrator must strictly control the write, update, and delete permissions of the MindIE ConfigMap. It is recommended that the permission on the installation directory be set to 750 and the permission on the file be set to 640. To prevent security risks caused by malicious modification and mounting to the pod, you are advised to use namespaces and RBAC to restrict permissions.
  • If the TLS certificate is enabled for the MindIE service, you need to remove the three health probe configurations (readinessProbe, livenessProbe, startupProbe, and subfields) of Kubernetes from the controller_init.yaml and coordinator_init.yaml files. Otherwise, the service may be frequently restarted by Kubernetes. The MindIE service ensures high reliability through mechanisms such as active/standby switchover and pod rescheduling. Therefore, removing these health probes does not compromise service reliability.
  • When requests are sent faster than they are processed, the Coordinator caches the unprocessed requests, leading to increased memory usage. As a result, request sending may be terminated because the memory usage reaches the upper limit. In this case, you need to appropriately increase the values of memory under the requests and limits fields in the coordinator_init.yaml file.
    • The memory parameter under the requests field indicates the minimum memory required for Coordinator running.
    • The memory parameter under the limits field indicates the upper limit of the memory available to the Coordinator.

    To ensure that the Coordinator can reliably obtain the memory, the values of memory under the requests and limits fields should be the same (if possible). By default, the recommended memory specifications are as follows:

    • 12Gi for a maximum of 10,000 concurrent requests
    • 24Gi for a maximum of 20,000 concurrent requests
    • 48Gi for a maximum of 40,000 concurrent requests
    • 108Gi for a maximum of 90,000 concurrent requests

    Maximum memory that can be occupied by the Coordinator = Value of body_limit x Value of max_requests x 120%

Directory structure of scripts:

├── boot_helper
│   ├── boot.sh
│   ├── gen_config_single_container.py
│   ├── get_group_id.py
│   ├── mindie_cpu_binding.py
│   ├── server_prestop.sh
│   └── update_mindie_server_config.py
├── chat.sh
├── conf
├── delete.sh
├── deploy.sh
├── deploy_ac_job.py
├── deployment
│   ├── controller_init.yaml
│   ├── coordinator_init.yaml
│   ├── mindie_ms_controller.yaml
│   ├── mindie_ms_coordinator.yaml
│   ├── mindie_server.yaml
│   ├── mindie_server_heterogeneous.yaml
│   ├── mindie_service_single_container.yaml
│   ├── mindie_service_single_container_base_A3.yaml
│   ├── server_init.yaml
│   └── single_container_init.yaml
├── gen_ranktable_helper
│   ├── gen_global_ranktable.py
│   └── global_ranktable.json
├── generate_stream.sh
├── log.sh
├── user_config.json
├── user_config_base_A3.json
└── utils
    ├── validate_config.py
    └── validate_utils.py

The following table describes the key directories and files for single-node prefill-decode disaggregation deployment.

Table 1 Key directories and files

Directory/File

Description

conf

Main service configuration file of the cluster management component and Server, which is used to manage scheduling policies and model configurations in the prefill-decode disaggregation scenario.

boot_helper

Contains the container startup script boot.sh. It modifies configuration files in the conf directory (i.e. generates several configurations files with unique port assignments for one Server config.json file) and generates the auxiliary script gen_config_single_container.py corresponding global_ranktable.

NOTE:

MindIE depends on the jemalloc.so library file. Do not install invalid SO files with the same name in the /usr/ directory, as this may introduce security risks such as arbitrary command execution.

deployment

Defines a Kubernetes deployment task and configures the NPU resource usage, number of instances, and image. For single-node prefill-decode disaggregation deployment, only the mindie_service_single_container.yaml file is used.

chat.sh

Simple dialog example of using curl to send HTTPS requests to the inference service, which is applicable to the prefix cache scenario.

generate_stream.sh

Streaming response example for sending an HTTPS request to the inference service using curl.

deploy.sh

Deployment script, which is used to start all MindIE components in one-click mode.

delete.sh

Uninstallation script, which is used to uninstall all MindIE components in one-click mode.

log.sh

Queries the printed logs of all deployed pods.

Procedure

The LLaMA3-8B with four instances is used as an example, each instance configured with two NPUs. The following is a deployment example. Perform the following operations in the deployment script path.

  1. Log in to the host of the cluster management node and create a namespace for the first deployment. The default value is mindie. Replace it as required.
    kubectl create namespace mindie
  2. Configure the startup configuration file ms_controller.json of Controller. For details about the configuration file, see Configuration Description.
    Set deploy_mode to pd_disaggregation_single_container, to enable single-node prefill-decode disaggregation deployment.
    "deploy_mode": "pd_disaggregation_single_container"
    • The request_coordinator_tls_enable, request_server_tls_enable, http_server_tls_enable, and cluster_tls_enable parameters in the ms_controller.json file specify whether to enable HTTPS. You are advised to enable HTTPS (by setting these parameters to true) to ensure communication security. If HTTPS is disabled, high network security risks exist.
      • true: HTTPS enabled for MindIE components in the cluster. You need to import certificates to the container and configure the corresponding certificate paths.
      • false: HTTP enabled for MindIE components in the cluster. You do not need to prepare certificates.
    • All components for single-node prefill-decode disaggregation deployment are located in a single pod. The default port number of the http_server module is 1026, which conflicts with other port numbers. You need to change the port number to another non-conflicting port number. The recommended value is 1027. (If the port specified in the configuration file conflicts with an existing port, the program will automatically assign an available, non-conflicting port to ensure successful startup.)
    • default_p_rate and default_d_rate in the ms_controller.json file control the ratio of prefill nodes to decode nodes in the cluster. The default values are both 0. The optimal ratio is automatically determined based on the model, hardware, and service information. You can also set the ratio to the actual number of prefill nodes and decode nodes based on the scenario situation. If the environment variables MINDIE_MS_P_RATE and MINDIE_MS_D_RATE are set, their values are preferentially read. For details, see Environment Variable.
  3. Configure the startup configuration file ms_coordinator.json of Coordinator. For details about the configuration file, see Configuration Description.

    For single-node prefill-decode disaggregation deployment, set deploy_mode to pd_separate.

    "deploy_mode": "pd_disaggregation_single_container"
    The controller_server_tls_enable, request_server_tls_enable, mindie_client_tls_enable, and mindie_mangment_tls_enable parameters in the ms_coordinator.json file specify whether to enable HTTPS. You are advised to enable HTTPS (by setting these parameters to true) to ensure communication security. If HTTPS is disabled, high network security risks exist.
    • true: HTTPS enabled for MindIE components in the cluster. You need to import certificates to the container and configure the corresponding certificate paths.
    • false: HTTP enabled for MindIE components in the cluster. You do not need to prepare certificates.
  4. Configure the config.json file for starting the Server service. Table 2 lists the parameters need to be configured for the prefill-decode disaggregation deployment mode. For details about the parameters, see "Core Concepts and Configurations" > "Configuration Parameters (Serving)" in MindIE LLM Development Guide.
    Table 2 Key parameters in config.json

    Parameter

    Description

    modelName

    Model name, which is associated with the model weight file, for example, llama3-8b.

    modelWeightPath

    Directory of the model weight file, which must be set to the weight path mounted to the container specified in the mindie_service_single_container.yaml file. The default path is /mnt/mindie-service/ms/model. Ensure that the cluster can schedule the Ascend compute node's model file existing in this path.

    worldSize

    Number of NPUs occupied by a prefill/decode instance. For example, if this parameter is set to 2, two NPUs are used.

    npuDeviceIds

    NPU ID, starting from 0. The total number of IDs is the same as the value of worldSize, for example, [[0,1]].

    inferMode

    Set this parameter to dmi.

    tp

    Number of tensor parallel processes on the entire network. This number is the value of worldSize. This parameter is a supplementary parameter and needs to be configured under the ModelConfig field.

    In single-node prefill-decode deployment, multiple Server processes run in one pod. Each Server process requires an independent configuration file. To simplify the configuration process, you can use the MINDIE_MS_GEN_SERVER_PORT environment variable in the mindie_service_single_container.yaml file for management. This environment variable supports two configuration modes:

    • true (default value): The system automatically generates multiple configuration files (config1.json, config2.json, ..., config{server_num}.json) based on the user-defined config.json file. Each configuration file corresponds to a Server process. The system assigns non-conflicting port numbers to port, managementPort, metricsPort, and interCommPort in the configuration file to ensure that each process runs independently.
    • false: You need to provide a set of configuration files (config1.json, config2.json, ..., config{server_num}.json) that match the number of Servers. In this mode, you can configure the parameters of each port. The system has a port conflict detection mechanism. If a port conflict occurs in the configuration file, the system automatically allocates other available port numbers to ensure that the program can be started properly.
  5. Configure the http_client_ctl.json configuration file, which is used to configure the HTTP client tool for cluster startup, liveness, and readiness probes. For details about the parameters, see Table 4.

    tls_enable specifies whether to enable HTTPS. If the MindIE components in the cluster use the HTTPS interface, set tls_enable to true, import certificates to the container, and configure the corresponding certificate paths. To use the HTTP interface, set tls_enable to false. No certificate file is required.

    You are advised to enable tls_enable to ensure communication security. If tls_enable is disabled, high network security risks exist.

  6. Configure the Kubernetes Deployment. describes the main parameters.

    Atlas 800I A2 inference server: Configure the mindie_service_single_container.yaml file in the deployment directory among the deployment script directories.

    Atlas 800I A3 SuperPoD Server: Configure the mindie_service_single_container_base_A3.yaml file in the deployment directory among the deployment script directories.

    • The script is for reference only. You need to ensure the security of the pod container. In the actual production environment, harden the security of the image and pod.
    • When using kubectl to deploy a Deployment, you can modify the YAML configuration file of the Deployment. Do not use dangerous configurations, and ensure that a secure image (non-root user) is used to configure secure pod contexts.
    • You must mount secure paths (non-soft links, non-dangerous system paths, and non-service sensitive paths) and set proper directory permissions to prevent public directories such as /home from being mounted and prevent container escape caused by tampering by unauthorized users.
    • The mount path of the model weight on the host must be configured as required. The following is an example:
      - name: model-path
        hostPath:
          path: /data/LLaMA3-8B
    • Fields to be configured in the mindie_service_single_container.yaml file:
      • huawei.com/Ascend910: total number of NPUs occupied by all prefill/decode instances, which is the same as the number specified by worldSize in all Server configuration files.
      • sp-block: size of a SuperPoD block, indicating the number of NPUs on a virtual SuperPoD. This parameter is configured only when Atlas 800I A3 SuperPoD Server is used (that is, the mindie_service_single_container_base_A3.yaml file is used). The value of this parameter must be the same as that of huawei.com/Ascend910.
      • image: image name.
      • startupProbe: startup probe, which checks the startup status every 180 seconds. If the startup probe fails for 30 consecutive times, the pod is considered to be not started successfully and will be automatically restarted. Set a proper startup time as required.
      • readinessProbe: readiness probe, which checks the readiness status every 180 seconds. If the probe fails, the pod stops receiving traffic until the check is passed. Set a proper triggering time as required.
      • livenessProbe: liveness probe, which checks the liveness status every 180 seconds. It is used to detect the container health. If any process does not respond, the container is restarted. Set a proper triggering time as required.
      • MINDIE_MS_GEN_SERVER_PORT: determines whether multiple configuration files of Server are automatically generated by the program based on the basic config.json file.
      • MINDIE_MS_P_RATE: proportion of prefill nodes in prefill-decode disaggregation deployment mode.
        • 0: The optimal proportion is automatically determined. If this parameter is set to 0, MINDIE_MS_D_RATE must also be set to 0.
        • Non-zero value: The proportion of prefill nodes is specified. If this parameter is set to a non-zero value, MINDIE_MS_D_RATE must also be set to a non-zero value, and the sum of MINDIE_MS_P_RATE and MINDIE_MS_D_RATE must be less than or equal to the number of NPUs in a single-node system.

        The priority is higher than that of default_p_rate in the ms_controller.json configuration file.

      • MINDIE_MS_D_RATE: proportion of decode nodes in prefill-decode disaggregation deployment mode.
        • 0: The optimal proportion is automatically determined. If this parameter is set to 0, MINDIE_MS_P_RATE must also be set to 0.
        • Non-zero value: The proportion of prefill nodes is specified. If this parameter is set to a non-zero value, MINDIE_MS_P_RATE must also be set to a non-zero value, and the sum of MINDIE_MS_P_RATE and MINDIE_MS_D_RATE must be less than or equal to the number of NPUs in a single-node system.

        The priority is higher than that of default_d_rate in the ms_controller.json configuration file.

      • MINDIE_LOG_TO_FILE: whether to write logs of each MindIE component to a file.

        The default value is 1, indicating that logs are written into a file. The value range is [false, true] or [0, 1].

      • MINDIE_LOG_TO_STDOUT: whether to print logs of each MindIE component.

        The default value is 1, indicating that logs are printed. The value range is [false, true] or [0, 1].

      • MINDIE_LOG_LEVEL: log level of each MindIE component.

        The default value is INFO. The log levels are [CRITICAL, ERROR, WARN, INFO, DEBUG].

    Table 3 Main configuration parameters

    Parameter

    Value Type

    Value Range

    Description

    huawei.com/Ascend910

    Int

      

    Total number of NPUs occupied by all prefill/decode instances, which is the same as the number specified by worldSize in all Server configuration files.

    sp-block

         

    Size of a SuperPoD block, indicating the number of NPUs on a virtual SuperPoD.

    This parameter is configured only when Atlas 800I A3 SuperPoD Server is used (that is, the mindie_service_single_container_base_A3.yaml file is used). The value of this parameter must be the same as that of huawei.com/Ascend910.

    image

         

    Image name.

    startupProbe

         

    Startup probe, which checks the startup status every 180 seconds. If the startup probe fails for 30 consecutive times, the pod is considered to be not started successfully and will be automatically restarted. Set a proper startup time as required.

    readinessProbe

         

    Readiness probe, which checks the readiness status every 180 seconds. If the probe fails, the pod stops receiving traffic until the check is passed. Set a proper triggering time as required.

    livenessProbe

         

    Liveness probe, which checks the liveness status every 180 seconds. It is used to detect the container health. If any process does not respond, the container is restarted. Set a proper triggering time as required.

    MINDIE_MS_GEN_SERVER_PORT

         

    Whether multiple configuration files of Server are automatically generated by the program based on the original config.json file.

    MINDIE_MS_P_RATE

         

    Proportion of prefill nodes in prefill-decode disaggregation deployment mode.

    • 0: The optimal proportion is automatically determined. If this parameter is set to 0, MINDIE_MS_D_RATE must also be set to 0.
    • Non-zero value: The proportion of prefill nodes is specified. If this parameter is set to a non-zero value, MINDIE_MS_D_RATE must also be set to a non-zero value, and the sum of MINDIE_MS_P_RATE and MINDIE_MS_D_RATE must be less than or equal to the number of NPUs in a single-node system.

    The priority is higher than that of default_p_rate in the ms_controller.json configuration file.

    MINDIE_MS_D_RATE

         

    Proportion of decode nodes in prefill-decode disaggregation deployment mode.

    • 0: The optimal proportion is automatically determined. If this parameter is set to 0, MINDIE_MS_P_RATE must also be set to 0.
    • Non-zero value: The proportion of prefill nodes is specified. If this parameter is set to a non-zero value, MINDIE_MS_P_RATE must also be set to a non-zero value, and the sum of MINDIE_MS_P_RATE and MINDIE_MS_D_RATE must be less than or equal to the number of NPUs in a single-node system.

    The priority is higher than that of default_d_rate in the ms_controller.json configuration file.

    MINDIE_LOG_TO_FILE

      
    • true or 1: Write logs to a file.
    • false or 0: Do not write logs to a file.

    Whether to write logs of each MindIE component to a file. The default value is 1.

    MINDIE_LOG_TO_STDOUT

      
    • true or 1: Print logs.
    • false or 0: Do not print logs.

    Whether to print logs of each MindIE component. The default value is 1.

    MINDIE_LOG_LEVEL

      
    • CRITICAL
    • ERROR
    • WARN
    • INFO
    • DEBUG

    Log level of each MindIE component. The default value is INFO.

  7. Configure the boot.sh script. For details about the environment variables that can be configured, see Table 4.
    Table 4 Environment variables

    Environment Variable

    Type

    Description

    MINDIE_INFER_MODE

    Prefill-decode disaggregation

    Inference mode, indicating whether prefill-decode disaggregation is enabled.

    • standard: prefill-decode hybrid deployment
    • dmi: prefill-decode disaggregation

    This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.

    MINDIE_DECODE_BATCH_SIZE

    Public variable

    Maximum decode batch size.

    Value range: [1, 5000]

    This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.

    MINDIE_PREFILL_BATCH_SIZE

    Public variable

    Maximum prefill batch size.

    Value range: [1, MINDIE_DECODE_BATCH_SIZE - 1]

    This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.

    MINDIE_MAX_SEQ_LEN

    Public variable

    Maximum sequence length.

    Integer in the range of (0, 4294967295]

    This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.

    MINDIE_MAX_ITER_TIMES

    Public variable

    Maximum output length.

    Integer in the range of [1, MINDIE_MAX_SEQ_LEN – 1]

    This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.

    MINDIE_MODEL_NAME

    Public variable

    Model name.

    This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.

    MINDIE_MODEL_WEIGHT_PATH

    Public variable

    Path of the model weight file.

    This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.

    MINDIE_ENDPOINT_HTTPS_ENABLED

    Public variable

    Whether to enable HTTPS on the prefill/decode instance.

    • true: enabled
    • false: disabled

    This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.

    MINDIE_INTER_COMM_TLS_ENABLED

    Public variable

    Whether to enable TLS for communication between inference instances.

    • true: enabled
    • false: disabled

    This environment variable is not included in the default boot.sh script configuration. If required, add it to the script.

    HSECEASY_PATH

    Public variable

    Path of the dependency library of the KMC decryption tool.

    MINDIE_MS_CONTROLLER_CONFIG_FILE_PATH

    Public variable

    Configuration file path of the Controller component.

    MINDIE_MS_COORDINATOR_CONFIG_FILE_PATH

    Public variable

    Configuration file path of the Coordinator component.

    ATB_LLM_HCCL_ENABLE

    Public variable

    Whether to enable the HCCL communication backend. This variable is enabled by default.

    • 1: enabled
    • 0: disabled

    When the single-node prefill-decode disaggregation service is deployed using Atlas 800I A2 inference server to run dense models, you are advised to disable this environment variable to improve performance.

    HCCL_OP_EXPANSION_MODE

    Public variable

    Location for expanding the orchestration of the communication algorithm. The default value is AIV.

    This environment variable takes effect when ATB_LLM_HCCL_ENABLE is set to 1. The values are as follows:

    • AI_CPU: AI CPU compute unit on the device
    • AIV: Vector Core compute unit on the device
    • HOST: CPU on the host. The device automatically selects a Coordinator based on the hardware model.
    • HOST_TS: CPU on the host. The host delivers tasks to the task scheduler on the device, and the task scheduler on the device schedules and executes the tasks.

    For details about this environment variable, see "HCCL_OP_EXPANSION_MODE" in CANN Environment Variable Reference (Community Edition).

    HCCL_RDMA_RETRY_CNT

    Public variable

    Number of retransmission times of the RDMA NIC. The value must be an integer ranging from 1 to 7. The default value is 7.

    HCCL_RDMA_TIMEOUT

    Public variable

    Retransmission timeout of the RDMA NIC.

    The formula for calculating the minimum retransmission timeout of the RDMA NIC is 4.096 μs * 2 ^ timeout. In the formula, timeout is the value of this environment variable, and the actual retransmission timeout is related to the user network status.

    Set this environment variable to an integer ranging from [5,20]. The default value is 18.

    HCCL_EXEC_TIMEOUT

    Public variable

    Synchronization wait time during task execution between devices. Within this configured time, each device process waits for other devices to perform communication synchronization. This variable is used to set the timeout interval of the first token.

    The value range is [0, 2147483647], in seconds. The default value is 60. The value 0 indicates that there is no timeout.

    Note: For details about log-related environment variables, see Log Configuration.

  8. Start the single-node prefill-decode disaggregation service.

    Before starting the service, you are advised to use the pre-check tool of MindStudio to verify the fields in the configuration file and check the validity of the configuration. For details, see Link.

    Configure the MindIE installation directory in the container. Change the value of MINDIE_USER_HOME_PATH based on the actual installation path during image creation. For example, if the installation path is /xxx/Ascend/mindie, set the value to /xxx.

    export MINDIE_USER_HOME_PATH={Image installation path}
    • Atlas 800I A2 inference server

      Run the following command to start the cluster:

      bash deploy.sh

      or

      bash deploy.sh 800i_a2
    • Atlas 800I A3 SuperPoD Server

      Run the following command to start the cluster:

      bash deploy.sh 800i_a3
  9. Run the kubectl command to check the prefill-decode cluster status.
    kubectl get pods -n mindie

    Command output:

    1
    2
    NAME                                     READY   STATUS    RESTARTS   AGE    IP               NODE       NOMINATED NODE   READINESS GATES
    mindie-server-7b795f8df9-vl9hv           1/1     Running   0          145m   xx.xx.xx.xx   ubuntu     <none>           <none>
    

    The Controller, Coordinator, and Server components are started in the pods whose names start with mindie-server.

    If the pod is in the Running status, the pod container has been successfully scheduled to a node and started. However, you need to further check whether the service program is started successfully.

    • You can use the provided log.sh script to query the standard output logs of pods and check whether a program exception occurs.
      bash log.sh
    • To query the logs of a specific pod (for example, mindie-server-7b795f8df9-vl9hv), run the following command:
      kubectl logs mindie-server-7b795f8df9-vl9hv -n mindie
    • The obtained pod logs are output to the global_ranktable.json file.
      • The following is an example of the global_ranktable.json file generated by Atlas 800I A2 inference server. Table 5 describes the parameters in the example.
        {
          "version": "1.0",
          "server_group_list": [
            {
              "group_id": "2",
              "server_count": "4",
              "server_list": [
                {
                  "server_id": "127.0.0.1",
                  "server_ip": "127.0.0.1",
                  "predict_port": "xxxx",
                  "mgmt_port": "xxxx",
                  "metric_port": "xxxx",
                  "inter_comm_port": "xxxx",
                  "device": [
                    {
                       "device_id": "0",
                       "device_ip": "1.1.1.1",
                       "rank_id": "0",
                       "device_logical_id": "0"
                     },
                    {
                       "device_id": "1",
                       "device_ip": "1.1.1.2",
                       "rank_id": "1",
                       "device_logical_id": "1"
                     }
                  ]
                },
                {
                  "server_id": "127.0.0.1",
                  "server_ip": "127.0.0.1",
                  "predict_port": "xxxx",
                  "mgmt_port": "xxxx",
                  "metric_port": "xxxx",
                  "inter_comm_port": "xxxx",
                  "device": [
                    {
                       "device_id": "2",
                       "device_ip": "1.1.1.3",
                       "rank_id": "2",
                       "device_logical_id": "2"
                     },
                    {
                       "device_id": "3",
                       "device_ip": "1.1.1.4",
                       "rank_id": "3",
                       "device_logical_id": "3"
                     }
                  ]
                },
                {
                  "server_id": "127.0.0.1",
                  "server_ip": "127.0.0.1",
                  "predict_port": "xxxx",
                  "mgmt_port": "xxxx",
                  "metric_port": "xxxx",
                  "inter_comm_port": "xxxx",
                  "device": [
                    {
                       "device_id": "4",
                       "device_ip": "1.1.1.5",
                       "rank_id": "4",
                       "device_logical_id": "4"
                     },
                    {
                       "device_id": "5",
                       "device_ip": "1.1.1.6",
                       "rank_id": "5",
                       "device_logical_id": "5"
                     }
                  ]
                },
                {
                  "server_id": "127.0.0.1",
                  "server_ip": "127.0.0.1",
                  "predict_port": "xxxx",
                  "mgmt_port": "xxxx",
                  "metric_port": "xxxx",
                  "inter_comm_port": "xxxx",
                  "device": [
                    {
                       "device_id": "6",
                       "device_ip": "1.1.1.7",
                       "rank_id": "6",
                       "device_logical_id": "6"
                     },
                    {
                       "device_id": "7",
                       "device_ip": "1.1.1.8",
                       "rank_id": "7",
                       "device_logical_id": "7"
                     }
                  ]
                }
              ]
            },
            {
              "group_id": "1",
              "server_count": "1",
              "server_list": [
                {
                  "server_ip": "127.0.0.1"
                }
              ]
            },
            {
              "group_id": "0",
              "server_count": "1",
              "server_list": [
                {
                  "server_ip": "127.0.0.1"
                },
              ]
            }
          ],
          "status": "completed"
        }
        
      • The following is an example of the global_ranktable.json file generated by Atlas 800I A3 SuperPoD Server. Table 5 describes the parameters in the example.
        {
          "version": "1.0",
          "server_group_list": [
            {
              "group_id": "2",
              "server_count": "4",
              "server_list": [
                {
                  "server_id": "127.0.0.1",
                  "server_ip": "127.0.0.1",
                  "predict_port": "xxxx",
                  "mgmt_port": "xxxx",
                  "metric_port": "xxxx",
                  "inter_comm_port": "xxxx",
                  "device": [
                    {
                       "device_id": "0",
                       "device_ip": "1.1.1.1",
                       "super_device_id": "xxxxxxxxxx",
                       "rank_id": "0",
                       "device_logical_id": "0"
                     },
                    {
                       "device_id": "1",
                       "device_ip": "1.1.1.2",
                       "super_device_id": "xxxxxxxxxx",
                       "rank_id": "1",
                       "device_logical_id": "1"
                     }
                  ]
                },
                {
                  "server_id": "127.0.0.1",
                  "server_ip": "127.0.0.1",
                  "predict_port": "xxxx",
                  "mgmt_port": "xxxx",
                  "metric_port": "xxxx",
                  "inter_comm_port": "xxxx",
                  "device": [
                    {
                       "device_id": "2",
                       "device_ip": "1.1.1.3",
                       "super_device_id": "xxxxxxxxxx",
                       "rank_id": "2",
                       "device_logical_id": "2"
                     },
                    {
                       "device_id": "3",
                       "device_ip": "1.1.1.4",
                       "super_device_id": "xxxxxxxxxx",
                       "rank_id": "3",
                       "device_logical_id": "3"
                     }
                  ]
                },
                {
                  "server_id": "127.0.0.1",
                  "server_ip": "127.0.0.1",
                  "predict_port": "xxxx",
                  "mgmt_port": "xxxx",
                  "metric_port": "xxxx",
                  "inter_comm_port": "xxxx",
                  "device": [
                    {
                       "device_id": "4",
                       "device_ip": "1.1.1.5",
                       "super_device_id": "xxxxxxxxxx",
                       "rank_id": "4",
                       "device_logical_id": "4"
                     },
                    {
                       "device_id": "5",
                       "device_ip": "1.1.1.6",
                       "super_device_id": "xxxxxxxxxx",
                       "rank_id": "5",
                       "device_logical_id": "5"
                     }
                  ]
                },
                {
                  "server_id": "127.0.0.1",
                  "server_ip": "127.0.0.1",
                  "predict_port": "xxxx",
                  "mgmt_port": "xxxx",
                  "metric_port": "xxxx",
                  "inter_comm_port": "xxxx",
                  "device": [
                    {
                       "device_id": "6",
                       "device_ip": "1.1.1.7",
                       "super_device_id": "xxxxxxxxxx",
                       "rank_id": "6",
                       "device_logical_id": "6"
                     },
                    {
                       "device_id": "7",
                       "device_ip": "1.1.1.8",
                       "super_device_id": "xxxxxxxxxx",
                       "rank_id": "7",
                       "device_logical_id": "7"
                     }
                  ]
                }
              ],
              "super_pod_list": [
                {
                  "super_pod_id": "0",
                  "server_list": [
                    {
                      "server_id": "127.0.0.1"
                     }
                   ]
                }
              ]
            },
            {
              "group_id": "1",
              "server_count": "1",
              "server_list": [
                {
                  "server_ip": "127.0.0.1"
                }
              ]
            },
            {
              "group_id": "0",
              "server_count": "1",
              "server_list": [
                {
                  "server_ip": "127.0.0.1"
                },
              ]
            }
          ],
          "status": "completed"
        }
      Table 5 Parameters in the global_ranktable.json file

      Parameter

      Type

      Description

      version

      String

      Version number of Ascend Operator.

      status

      String

      Cluster information status.

      • completed: deployment completed
      • initializing

      group_id

      String

      ID of each component.

      • 0: Coordinator deployment information
      • 1: Controller deployment information
      • 2: Server deployment information

      server_count

      String

      Total number of processes of each component.

      server_list

      JSON object array

      Process deployment information of each component.

      • Controller instance: The valid list length is [0, 1].
      • Coordinator instance: The valid list length is [0, 1].
      • Server instance: The valid list length is [0, npu_num]. (npu_num indicates the number of NPUs.)

      server_id

      String

      Node host IP address of a component.

      server_ip

      String

      Node IP address of a component.

      predict_port

      String

      Port number bound to the service-plane RESTful interface provided by EndPoint.

      mgmt_port

      String

      Port number bound to the internal interface provided by EndPoint.

      metric_port

      String

      Port number of the service management and control metric interface (Prometheus format).

      inter_comm_port

      String

      Communication port between instances in a cluster.

      device

      JSON object array

      NPU device information. Only Server has this attribute. The valid list length is [1, 128].

      device_id

      String

      NPU device ID.

      device_ip

      String

      NPU IP address.

      super_device_id

      String

      ID of an NPU device on a SuperPoD. This parameter is involved only in Atlas 800I A3 SuperPoD Server.

      rank_id

      String

      Logical ID of the NPU, that is, the sequence ID of the visible device in the pod where Server is located.

      device_logical_id

      String

      Logical ID of the NPU, that is, the sequence ID of the visible device in the pod where Server is located.

      super_pod_list

      String

      SuperPoD list. This parameter is involved only in Atlas 800I A3 SuperPoD Server.

      super_pod_id

      String

      ID of the current SuperPoD. This parameter is involved only in Atlas 800I A3 SuperPoD Server.

    • To access the container to search for more information, run the following command:
      kubectl exec -it mindie-server-7b795f8df9-vl9hv -n mindie -- bash
  10. Use the provided generate_stream.sh script to initiate a streaming inference request.
    After the deployment is successful, port 31015 is opened on the node for the inference service interface. You need to change the IP address in generate_stream.sh to the IP address of the host IP address of the cluster management node. If HTTPS is enabled for Coordinator, you need to configure related certificates. If HTTP is used, change HTTPS in the script to HTTP and delete the certificate configuration.
    bash generate_stream.sh

    HTTPS is recommended, as it is more secure than HTTP.

  11. Delete the prefill-decode cluster.
    To stop the prefill-decode service or modify the service configuration for instance redeployment, run the following command to delete the deployed instance. Then, redeploy an instance by referring to 8.
    bash delete.sh mindie ./
    • mindie indicates the namespace created in 1. Replace it as required.
    • The uninstallation script delete.sh must be executed in the examples/kubernetes_deploy_scripts directory. Otherwise, the service cannot be stopped and an error is reported.