Configuration Parameters (Model Side)
The path of the configuration file in the atb-models installation directory on the model side is ${ATB_SPEED_HOME_PATH}/atb_llm/conf/config.json.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 | { "llm": { "ccl": { "enable_mc2": "true" }, "stream_options": { "micro_batch": "false" }, "engine": { "graph": "cpp" }, "parallel_options": { "o_proj_local_tp": -1, "dense_mlp_local_tp": -1, "lm_head_local_tp": -1, "hccl_buffer": 128, "hccl_moe_ep_buffer": 512, "hccl_moe_tp_buffer": 64 }, "pmcc_obfuscation_options": { "enable_model_obfuscation": false, "data_obfuscation_ca_dir": "", "kms_agent_port": 1024 }, "kv_cache_options": { "enable_nz": false }, "weights_options": { "low_cpu_memory_mode": false }, "enable_reasoning": "false", "tool_call_options": { "tool_call_parser": "" }, "chat_template": "", "ep_level": 1, "communication_backend": { "prefill": "lccl", "decode": "lccl" } }, "models": { "qwen_moe": { "eplb": { "level": 0, "expert_map_file": "" }, "ep_level": 2 }, "deepseekv2": { "eplb": { "level": 0, "expert_map_file": "", "num_redundant_experts": 0, "aggregate_threshold": 128, "num_expert_update_ready_countdown": 16 }, "ep_level": 1, "enable_dispatch_combine_v2": true, "communication_backend": { "prefill":"lccl", "decode": "lccl" }, "mix_shared_routing": false, "enable_gmmswigluquant": false, "enable_oproj_prefetch": false, "enable_mlapo_prefetch": false, "num_dangling_shared_experts": 0, "enable_swiglu_quant_for_shared_experts": false, "enable_init_routing_cutoff": false, "topk_scaling_factor": 1.0, "h3p":{ "enable_qkvdown_dp": "true", "enable_gating_dp": "true", "enable_shared_expert_dp": "false", "enable_shared_expert_overlap": "false" } } } } |
Parameters in llm
Parameter |
Value Type |
Value Range |
Description |
|---|---|---|---|
enable_reasoning |
Bool |
|
Whether to enable model output parsing. The output is parsed into the reasoning content and content fields.
Mandatory. The default value is false. This function can be enabled only for the Qwen3-32B, Qwen3-30B-A3B, DeepSeek-R1-671B, and DeepSeek-V3.1 models. |
chat_template |
String |
|
Input a custom dialog template to replace the default one of the model.
|
tool_call_options |
|||
tool_call_parser |
String |
|
Parsing mode of the tool when Function Call is enabled.
|
ccl |
|||
enable_mc2 |
Bool |
|
Whether to enable the communication-computing fused operator feature.
|
stream_options |
|||
micro_batch |
Bool |
|
Whether to enable the communication-computing dual-stream overlapping feature.
|
engine |
|||
graph |
String |
|
Enables the cpp graph or Python graph.
|
parallel_options |
|||
o_proj_local_tp |
Integer |
[1, worldSize / Number of nodes] |
Split count for the Attention O matrix.
|
lm_head_local_tp |
Integer |
[1, worldSize / Number of nodes] |
Tensor parallel split count for the LmHead layer.
|
hccl_buffer |
Integer |
≥ 1 |
Buffer size of the shared data in communicators except the MoE communicator.
|
hccl_moe_ep_buffer |
Integer |
≥ 512 |
Buffer size of the shared data in the MoE EP communicator.
|
hccl_moe_tp_buffer |
Integer |
≥ 64 |
Buffer size of the shared data in the MoE TP communicator.
|
kv_cache_options |
|||
enable_nz |
Bool |
|
Specifies whether to enable the NZ format for the KV cache.
|
weights_options |
|||
low_cpu_memory_mode |
Bool |
|
Specifies whether to enable the low CPU and memory usage mode.
|
Parameters in models
Parameter |
Value Type |
Value Range |
Description |
|---|---|---|---|
deepseekv2 |
map |
- |
deepseekv2 configuration. For details, see Parameters in deepseekv2. |
Parameters in deepseekv2
Parameter |
Value Type |
Value Range |
Description |
|---|---|---|---|
ep_level |
Integer |
[1,2] |
Implementation form of expert parallelism (EP). 1: EP based on AllGather communication 2: EP based on AllToAll and communication-computing fusion |
topk_scaling_factor |
Float |
(0,1] |
Top k result truncation parameter.
|
enable_init_routing_cutoff |
Bool |
|
Whether to allow top k result truncation.
|
alltoall_ep_buffer_scale_factors |
list[list[int, float]] |
Each member in the list contains two numbers. The first number is a non-negative integer, and the second number is a floating-point number greater than 0. The members are sorted in descending order based on the first number. |
Size of the AllToAll communication buffer. The second-level list contains two elements. The first number is the sequence length, and the second number is the buffer coefficient. The sequence length is the condition for selecting the buffer coefficient. Example: [[1048576, 1.32], [524288, 1.4], [262144, 1.53], [131072, 1.8], [32768, 3.0], [8192, 5.2], [0, 8.0]]
|
num_dangling_shared_experts |
Integer |
Positive integer |
Number of external shared experts. Currently, only the scenario where the Atlas 800I A3 SuperPoD Server 144 card is used and load balancing is disabled is supported. The recommended value is 32. The default value is 0 (disabling the feature). |
enable_mlapo_prefetch |
Bool |
|
Enables or disables mlapo prefetch.
Default value: false |
enable_oproj_prefetch |
Bool |
|
Enables or disables oproj prefetch. For the Atlas 800I A2 inference server, you are advised not to enable this feature. For the Atlas 800I A3 SuperPoD Server, you are advised to enable this feature and OprojTp at the same time, and set OprojTp to 2.
Default value: false |
eplb |
|||
level |
Integer |
[0, 3] |
Default value: 0 |
expert_map_file |
String |
The file path exists. |
Path of the expert deployment table for static load balancing in redundancy mode. Default value: "" |
num_redundant_experts |
Integer |
[0, n_routed_experts] |
This parameter is not supported in the current version. Number of redundant experts. Default value: 0 |
aggregate_threshold |
Integer |
≥ 1 |
This parameter is not supported in the current version. Frequency of triggering the dynamic EPLB algorithm, in the unit of decoding times. For example, 50 indicates that the dynamic EPLB algorithm is triggered once for 50 decoding times. If the algorithm considers that the popularity exceeds a certain threshold, the routing table is adjusted to reduce the algorithm popularity. |
buffer_expert_layer_num |
Integer |
[1, num_moe_layers] |
This parameter is not supported in the current version. Number of layers transferred by dynamic EPLB each time. Because weight transfer is asynchronous, an extra buffer memory is required to store the new weight that is being transferred without affecting the original decoding. When this parameter is set to 1, only one layer is transferred at a time, and then the weight and routing table of the layer are updated. The formula for calculating the affected memory is as follows: buffer_expert_layer_num × local_experts_num × 44 MB (44 MB is the size of an int8 expert). |
num_expert_update_ready_countdown |
Integer |
≥ 1 |
This parameter is not supported in the current version. Frequency of checking whether the host-to-device transfer is complete, in the unit of decoding times. Because weight transfer is asynchronous, the weight and routing table can be updated only after all EP cards are transferred. Communication is introduced here. When there are a large number of transfer layers, the frequency can be reduced to lower the overhead on the EPLB framework side. |
h3p |
|||
enable_qkvdown_dp |
Bool |
|
Whether to enable the "qkvdown dp" feature to reduce the computing and communication traffic and improve the performance in the prefill phase. Default value: true |
enable_gating_dp |
Bool |
|
Whether to enable the "gating dp" feature to reduce the computing and communication traffic and improve the performance in the prefill phase. Default value: true This feature is supported only when ep_level is set to 1. |
enable_shared_expert_dp |
Bool |
|
Whether to enable the "shared expert dp" feature to improve the performance in the prefill phase. Default value: false
|
enable_shared_expert_overlap |
Bool |
|
Whether to enable the communication-computing dual-stream overlapping feature for shared experts to improve the performance in the prefill phase in specific scenarios (the input sequence length is 2K to 16K). Default value: false
|
enable_dispatch_combine_v2 |
Bool |
|
Whether to enable the v2 version of the dispatch and combine operators when ep_level is set to 2 to improve the performance in the decoding phase. Default value: true |
mix_shared_routing |
Bool |
|
Whether to merge shared experts and route experts to achieve parallel computing for them.
|