Sequence Parallelism

Sequence parallelism (SP) splits the KV cache so that the KV cache saved by each sprank is different, reducing the graphics memory usage and supporting long sequences.

Constraints

  • This feature is supported by the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.
  • Currently, only the W8A8 quantization models of DeepSeek-R1, W4A8 quantization models of DeepSeek-R1, DeepSeek-V3, and DeepSeek-V3.1 support this feature.
  • This feature is supported in the prefill-decode disaggregation and prefill-decode hybrid deployment scenarios.
  • The value of SP must be equal to that of TP.
  • In the prefill-decode hybrid deployment scenario:
    • This feature can be used together with DP and TP. The product of DP and TP is equal to the value of worldSize.
    • This feature can be used together with asynchronous scheduling and Prefix Cache, and it can be used in scenarios where MTP equals 1.
  • In the prefill-decode disaggregation scenario:
    • SP can be enabled on prefill nodes only. This feature can be used together with DP, TP, and MTP. The product of DP and TP is equal to the value of worldSize.
    • This feature can be used together with MTP, asynchronous scheduling, and Prefix Cache.
  • This feature supports only FP16 and does not support BF16.

Parameters

Table 1 describes the serving parameters required for enabling the SP feature.

Table 1 Supplementary parameters of SP: ModelConfig in ModelDeployConfig

Parameter

Value Type

Value Range

Description

sp

Integer

sp=tp

Number of parts obtained after KV cache splitting.

Running Inference

  1. Open the config.json file of the Server.
    cd {MindIE installation directory}/latest/mindie-service/
    vi conf/config.json
  2. Set serving parameters. Add the sp field (the following information in bold) to the config.json file of the Server. For details about the parameter fields, see Table 1. For details about the serving parameters, see Configuration Parameters (Service-Specific).
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    "ModelDeployConfig" :
    {
        "maxSeqLen" : 2560,
        "maxInputTokenLen" : 2048,
        "truncation" : false,
        "ModelConfig" : [
            {
                "modelInstanceType" : "Standard",
                "modelName" : "DeepSeek-R1_w8a8",
                "modelWeightPath" : "/data/weights/DeepSeek-R1_w8a8",
                "worldSize" : 16,
                "cpuMemSize" : 5,
                "npuMemSize" : -1,
                "backendType" : "atb",
                "trustRemoteCode" : false,
                "dp": 2,
                "sp": 8,
                "tp": 8,
                "moe_ep": 16,
                "moe_tp": 1
            }
        ]
    }
    
  3. Start the service.
    ./bin/mindieservice_daemon