Data Parallelism

Data parallelism (DP) splits inference requests into multiple batches and allocates them to different compute devices for parallel processing. These devices process different batches of data in parallel, and then merge the results.

Constraints

  • This feature is supported by the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.
  • The Attention and MLP modules of all models, and the LM Head module of DeepSeek-V2 support DP.
  • DP can be used together with tensor parallelism in the same module.

Parameters

Table 1 describes the supplementary parameters that need to be set to enable DP.

Table 1 Supplementary parameters of the DP feature: ModelConfig in ModelDeployConfig

Parameter

Value Type

Value Range

Description

tp

int32_t

  • If dp is not set or is set to -1, the value is identical to that of the worldSize parameter.
  • When used together with dp, the value of tp × dp must be equal to that of the worldSize parameter.

    For example, if worldSize is set to 8 and dp is set to 2, the value of tp must be 4.

Number of tensor parallelism processes on the entire network.

(Optional) The default value is the value of worldSize.

dp

int32_t

  • If this parallelism mode is not used, the value is -1.
  • When used together with tp, the value of dp × tp must be equal to that of the worldSize parameter.

    For example, if worldSize is set to 8 and tp is set to 4, the value of dp must be 2.

Number of DP processes in the Attention module.

(Optional) The default value is -1, indicating that DP is not performed.

cp

int32_t

  • If this parallelism mode is not used, the value is 1.
  • When used together with sp, the value of dp × tp × cp must be equal to that of the worldSize parameter, and dp must be 1.

    For example, if worldSize is set to 16, tp is set to 8, and sp is set to 8, dp must be 1 and cp must be 2.

(Optional) The default value is 1, indicating that context parallelism is not performed.

Number of context parallelism processes in the Attention module.

sp

int32_t

  • If this parallelism mode is not used, the value is 1.
  • When used together with tp, the value of sp must be equal to that of tp.

    For example, if worldSize is set to 16, tp is set to 8, and dp is set to 2, sp must be 8.

(Optional) The default value is 1, indicating that sequence parallelism is not performed.

Number of sequence parallelism processes in the Attention module.

If the preceding supplementary parameters are not configured, the tp and moe_tp parallelism modes are used by default during inference.

Running Inference

CANN and MindIE have been installed. For details, see MindIE Installation Guide.

  1. Set environment variables for optimizing graphics memory allocation.
    export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True
    export ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=3
  2. Open the config.json file of the Server.
    cd {MindIE installation directory}/latest/mindie-service/
    vi conf/config.json
  3. Set serving parameters. Add the corresponding parameters to the config.json file of the Server based on Table 1. For details about the serving parameters, see Configuration Parameters (Service-Specific). The parameter configuration example is as follows:
    "ModelConfig" : [
        {
            "modelInstanceType" : "Standard",
            "modelName" : "deepseekv2",
            "modelWeightPath" : "/home/data/DeepSeek-V2-Chat-W8A8-BF16/",
            "worldSize" : 8,
            "cpuMemSize" : 5,
            "npuMemSize" : 1,
            "backendType" : "atb",
            "trustRemoteCode" : false,
            "tp": 1,
            "dp": 8,
            "cp": 1,
            "sp": 1
        }
    ]

    In the preceding parameter settings, eight devices are used for inference, the Attention module uses DP, and the MoE model uses tensor parallelism.

  4. Start the service.
    ./bin/mindieservice_daemon
  5. Send an inference request. For details, see Inference API.