Data Parallelism
Data parallelism (DP) splits inference requests into multiple batches and allocates them to different compute devices for parallel processing. These devices process different batches of data in parallel, and then merge the results.
Constraints
- This feature is supported by the Atlas 800I A2 inference server and Atlas 800I A3 SuperPoD Server.
- The Attention and MLP modules of all models, and the LM Head module of DeepSeek-V2 support DP.
- DP can be used together with tensor parallelism in the same module.
Parameters
Table 1 describes the supplementary parameters that need to be set to enable DP.
Parameter |
Value Type |
Value Range |
Description |
|---|---|---|---|
tp |
int32_t |
|
Number of tensor parallelism processes on the entire network. (Optional) The default value is the value of worldSize. |
dp |
int32_t |
|
Number of DP processes in the Attention module. (Optional) The default value is -1, indicating that DP is not performed. |
cp |
int32_t |
|
(Optional) The default value is 1, indicating that context parallelism is not performed. Number of context parallelism processes in the Attention module. |
sp |
int32_t |
|
(Optional) The default value is 1, indicating that sequence parallelism is not performed. Number of sequence parallelism processes in the Attention module. |
If the preceding supplementary parameters are not configured, the tp and moe_tp parallelism modes are used by default during inference.
- Set environment variables for optimizing graphics memory allocation.
export PYTORCH_NPU_ALLOC_CONF=expandable_segments:True export ATB_WORKSPACE_MEM_ALLOC_ALG_TYPE=3
- Open the config.json file of the Server.
cd {MindIE installation directory}/latest/mindie-service/ vi conf/config.json - Set serving parameters. Add the corresponding parameters to the config.json file of the Server based on Table 1. For details about the serving parameters, see Configuration Parameters (Service-Specific). The parameter configuration example is as follows:
"ModelConfig" : [ { "modelInstanceType" : "Standard", "modelName" : "deepseekv2", "modelWeightPath" : "/home/data/DeepSeek-V2-Chat-W8A8-BF16/", "worldSize" : 8, "cpuMemSize" : 5, "npuMemSize" : 1, "backendType" : "atb", "trustRemoteCode" : false, "tp": 1, "dp": 8, "cp": 1, "sp": 1 } ]In the preceding parameter settings, eight devices are used for inference, the Attention module uses DP, and the MoE model uses tensor parallelism.
- Start the service.
./bin/mindieservice_daemon
- Send an inference request. For details, see Inference API.