Context Parallelism

Context parallelism (CP) performs parallel computing for the Self-attention module in the sequence dimension. CP splits long sequences in the context dimension, allocates the sequences to different devices for parallel processing, and reduces the response time of the first token. The CP implementation includes the following:

  1. Each device calculates its own attention, and devices transfer KV values in ring mode to obtain the result of the block-based computation. The overall principle is similar to ring-attention.
  2. The Flash-attention 2 algorithm is used to perform block-based computation and correct the result.

Constraints

  • This feature is supported by the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.
  • Currently, only the W8A8 quantization models of DeepSeek-R1, W4A8 quantization models of DeepSeek-R1, DeepSeek-V3, and DeepSeek-V3.1 support this feature.
  • Currently, CP cannot be enabled independently. To enable CP, SP must be enabled at the same time.
  • This feature is supported in the prefill-decode disaggregation and prefill-decode hybrid deployment scenarios.
  • In the prefill-decode hybrid deployment scenario:
    • This feature can be used together with SP and TP. When CP is enabled, the value of DP must be 1, the value of SP must be equal to that of TP, and the product of CP, DP, and TP must be equal to the value of worldSize.
    • This feature can be used together with asynchronous scheduling and Prefix Cache, and it can be used in scenarios where MTP equals 1.
  • In the prefill-decode disaggregation scenario:
    • CP can be enabled on prefill nodes only. This feature can be used together with SP, TP, and MTP. When CP is enabled, the value of DP must be 1, the value of SP must be equal to that of TP, and the product of CP, DP, and TP must be equal to the value of worldSize.
    • This feature can be used together with MTP, asynchronous scheduling, and Prefix Cache.
  • This feature supports only FP16 and does not support BF16.

Parameters

Table 1 describes the serving parameters required for enabling the CP feature.

Table 1 Supplementary parameters: ModelConfig in ModelDeployConfig

Parameter

Value Type

Value Range

Description

cp

Integer

[1, 2]

Number of parts obtained after an input sequence is split.

1: indicates that the CP feature is disabled.

2: indicates that the input sequence is split into two parts.

Currently, if the CP feature is enabled, the number of split parts can only be 2.

Running Inference

  1. Open the config.json file of the Server.
    cd {MindIE installation directory}/latest/mindie-service/
    vi conf/config.json
  2. Set serving parameters. Add the cp field (the following information in bold) to the config.json file of the Server. For details about the parameter fields, see Table 1. For details about the serving parameters, see Configuration Parameters (Service-Specific).
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    "ModelDeployConfig" :
    {
        "maxSeqLen" : 2560,
        "maxInputTokenLen" : 2048,
        "truncation" : false,
        "ModelConfig" : [
            {
                "modelInstanceType" : "Standard",
                "modelName" : "DeepSeek-R1_w8a8",
                "modelWeightPath" : "/data/weights/DeepSeek-R1_w8a8",
                "worldSize" : 16,
                "cpuMemSize" : 5,
                "npuMemSize" : -1,
                "backendType" : "atb",
                "trustRemoteCode" : false,
                "dp": 1,
                "cp": 2,
                "sp": 8,
                "tp": 8,
                "moe_ep": 16,
                "moe_tp": 1
            }
        ]
    }
    
  3. Start the service.
    ./bin/mindieservice_daemon