Configuration Parameters (Model Side)

The path of the configuration file in the atb-models installation directory on the model side is ${ATB_SPEED_HOME_PATH}/atb_llm/conf/config.json.

The format of the model configuration file config.json is as follows:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
{
  "llm": {
    "ccl": {
      "enable_mc2": "true"
    },
    "stream_options": {
      "micro_batch": "false"
    },
    "engine": {
      "graph": "cpp"
    },
    "parallel_options": {
      "o_proj_local_tp": -1,
      "dense_mlp_local_tp": -1,
      "lm_head_local_tp": -1,
      "hccl_buffer": 128,
      "hccl_moe_ep_buffer": 512,
      "hccl_moe_tp_buffer": 64
    },
    "pmcc_obfuscation_options": {
      "enable_model_obfuscation": false,
      "data_obfuscation_ca_dir": "",
      "kms_agent_port": 1024
    },
    "kv_cache_options": {
      "enable_nz": false
    },
    "weights_options": {
      "low_cpu_memory_mode": false
    },
    "enable_reasoning": "false",
    "tool_call_options": {
        "tool_call_parser": ""
    },
    "chat_template": "",
    "ep_level": 1,
    "communication_backend": {
        "prefill": "lccl",
        "decode": "lccl"
    }
  },
  "models": {
    "qwen_moe": {
      "eplb": {
        "level": 0,
        "expert_map_file": ""
      },
      "ep_level": 2
    },
    "deepseekv2": {
      "eplb": {
        "level": 0,
        "expert_map_file": "",
        "num_redundant_experts": 0,
        "aggregate_threshold": 128,
        "num_expert_update_ready_countdown": 16
      },
      "ep_level": 1,
      "enable_dispatch_combine_v2": true,
      "communication_backend": {
        "prefill":"lccl",
        "decode": "lccl"
      },
      "mix_shared_routing": false,
      "enable_gmmswigluquant": false,
      "enable_oproj_prefetch": false,
      "enable_mlapo_prefetch": false,
      "num_dangling_shared_experts": 0,
      "enable_swiglu_quant_for_shared_experts": false,
      "enable_init_routing_cutoff": false,
      "topk_scaling_factor": 1.0,
      "h3p":{
        "enable_qkvdown_dp": "true",
        "enable_gating_dp": "true",
        "enable_shared_expert_dp": "false",
        "enable_shared_expert_overlap": "false"
      }
    }
  }
}

Parameters in llm

Parameter

Value Type

Value Range

Description

enable_reasoning

Bool

  • true
  • false

Whether to enable model output parsing. The output is parsed into the reasoning content and content fields.

  • false: disable
  • true: enable

Mandatory. The default value is false.

This function can be enabled only for the Qwen3-32B, Qwen3-30B-A3B, DeepSeek-R1-671B, and DeepSeek-V3.1 models.

chat_template

String

  • File path in .jinja format.
  • ""

Input a custom dialog template to replace the default one of the model.

  • Default value: ""
  • For DeepSeek models, the default chat_template in tokenizer_config.json cannot be called using tools. You can use this parameter to input the chat_template that can be called using tools.
  • This parameter can be used to input a custom template for DeepSeek, Qwen (large language model), ChatGLM, and Llama models.

tool_call_options

tool_call_parser

String

  • Optional registered names in the registered ToolsCallProcessor names. For details, see Table 2.
  • ""

Parsing mode of the tool when Function Call is enabled.

  • Default value: ""
  • If this parameter is not set or is set to an incorrect value, the default tool parsing mode of the current model will be used.
  • When DeepSeek V3.1 uses Function Call, this parameter must be set to deepseek_v31. For other models, use the default value.
  • This parameter is used together with chat_template. The corresponding ToolsCallProcessor is selected based on the Function Call format specified in chat_template.

ccl

enable_mc2

Bool

  • true
  • false

Whether to enable the communication-computing fused operator feature.

  • Default value: true
  • This feature cannot be enabled together with the communication-computing dual-stream overlapping feature.

stream_options

micro_batch

Bool

  • true
  • false

Whether to enable the communication-computing dual-stream overlapping feature.

  • This feature cannot be enabled together with the communication-computing fused operator feature.
  • This feature cannot be enabled together with the Python graph.
  • Only the Qwen2.5-14B, Qwen3-14B, DeepSeek-R1, and DeepSeek-V3.1 models support this feature.
  • Enabling this feature will occupy extra graphics memory. In serving scenarios, if the number of KV caches decreases, scheduling will be affected and the throughput will decrease. Therefore, you are advised not to enable this feature when the graphics memory is limited.
  • Default value: false

engine

graph

String

  • cpp
  • python

Enables the cpp graph or Python graph.

  • Only the Llama3.1-8B, Qwen2.5-7B, Qwen3-14B, and Qwen3-32B models support the Python graph.
  • To enable the low CPU and memory usage mode (low_cpu_memory_mode), you need to enable the Python graph.
  • Default value: cpp

parallel_options

o_proj_local_tp

Integer

[1, worldSize / Number of nodes]

Split count for the Attention O matrix.

  • Only the DeepSeek-R1, DeepSeek-V3, and DeepSeek-V3.1 models support this feature.
  • Default value: -1, indicating that splitting is disabled

lm_head_local_tp

Integer

[1, worldSize / Number of nodes]

Tensor parallel split count for the LmHead layer.

  • Only the DeepSeek-R1, DeepSeek-V3, and DeepSeek-V3.1 models support this feature.
  • Default value: -1, indicating that splitting is disabled

hccl_buffer

Integer

≥ 1

Buffer size of the shared data in communicators except the MoE communicator.

  • Default value: 128
  • If the value is too large, error message "out of memory" will be displayed. The default value is recommended.

hccl_moe_ep_buffer

Integer

≥ 512

Buffer size of the shared data in the MoE EP communicator.

  • Default value: 512
  • If the value is too large, error message "out of memory" will be displayed. The default value is recommended.

hccl_moe_tp_buffer

Integer

≥ 64

Buffer size of the shared data in the MoE TP communicator.

  • Default value: 64
  • If the value is too large, error message "out of memory" will be displayed. The default value is recommended.

kv_cache_options

enable_nz

Bool

  • true
  • false

Specifies whether to enable the NZ format for the KV cache.

  • Only the DeepSeek-R1, DeepSeek-V3, and DeepSeek-V3.1 models support this feature. The NZ format is automatically enabled in the FA3 quantization scenario.
  • Default value: false

weights_options

low_cpu_memory_mode

Bool

  • true
  • false

Specifies whether to enable the low CPU and memory usage mode.

  • This feature must be enabled together with the Python graph.
  • Only the Qwen2.5-7B model supports this feature.
  • The default value is false (disabling the feature).
    NOTE:

    After this function is enabled, model parameters will be loaded tensor by tensor in the weight loading phase, which significantly reduces the CPU and memory usage. This function is especially suitable for memory-limited scenarios such as edge devices and small-specification servers. In an environment with sufficient CPU and memory resources, you are advised to disable this function to reduce the loading time.

Parameters in models

Parameter

Value Type

Value Range

Description

deepseekv2

map

-

deepseekv2 configuration. For details, see Parameters in deepseekv2.

Parameters in deepseekv2

Parameter

Value Type

Value Range

Description

ep_level

Integer

[1,2]

Implementation form of expert parallelism (EP).

1: EP based on AllGather communication

2: EP based on AllToAll and communication-computing fusion

topk_scaling_factor

Float

(0,1]

Top k result truncation parameter.

  • When ep_level is set to 1, the latter part of hidden_states of each device is invalid data. You can set the truncation parameter to reduce the graphics memory overhead.
  • In addition, enable_init_routing_cutoff must be set to true.

enable_init_routing_cutoff

Bool

  • true
  • false

Whether to allow top k result truncation.

  • The default value is false (disabling the feature).
  • This parameter can be set when ep_level is set to 1.

alltoall_ep_buffer_scale_factors

list[list[int, float]]

Each member in the list contains two numbers. The first number is a non-negative integer, and the second number is a floating-point number greater than 0.

The members are sorted in descending order based on the first number.

Size of the AllToAll communication buffer. The second-level list contains two elements. The first number is the sequence length, and the second number is the buffer coefficient. The sequence length is the condition for selecting the buffer coefficient. Example:

[[1048576, 1.32], [524288, 1.4], [262144, 1.53], [131072, 1.8], [32768, 3.0], [8192, 5.2], [0, 8.0]]

  • You are advised to configure this parameter when ep_level is set to 2 and you need to manage the graphics memory in a refined manner.
  • This parameter does not take effect when ep_level is set to 1.

num_dangling_shared_experts

Integer

Positive integer

Number of external shared experts.

Currently, only the scenario where the Atlas 800I A3 SuperPoD Server 144 card is used and load balancing is disabled is supported. The recommended value is 32.

The default value is 0 (disabling the feature).

enable_mlapo_prefetch

Bool

  • true
  • false

Enables or disables mlapo prefetch.

  • true: enable
  • false: disable

Default value: false

enable_oproj_prefetch

Bool

  • true
  • false

Enables or disables oproj prefetch.

For the Atlas 800I A2 inference server, you are advised not to enable this feature. For the Atlas 800I A3 SuperPoD Server, you are advised to enable this feature and OprojTp at the same time, and set OprojTp to 2.

  • true: enable
  • false: disable

Default value: false

eplb

level

Integer

[0, 3]

  • 0: disables load balancing.
  • 1: enables static load balancing in redundancy mode.
  • 2: enables dynamic load balancing in redundancy mode (not supported currently).
  • 3: enables forcible load balancing.

Default value: 0

expert_map_file

String

The file path exists.

Path of the expert deployment table for static load balancing in redundancy mode.

Default value: ""

num_redundant_experts

Integer

[0, n_routed_experts]

This parameter is not supported in the current version.

Number of redundant experts.

Default value: 0

aggregate_threshold

Integer

≥ 1

This parameter is not supported in the current version.

Frequency of triggering the dynamic EPLB algorithm, in the unit of decoding times.

For example, 50 indicates that the dynamic EPLB algorithm is triggered once for 50 decoding times. If the algorithm considers that the popularity exceeds a certain threshold, the routing table is adjusted to reduce the algorithm popularity.

buffer_expert_layer_num

Integer

[1, num_moe_layers]

This parameter is not supported in the current version.

Number of layers transferred by dynamic EPLB each time.

Because weight transfer is asynchronous, an extra buffer memory is required to store the new weight that is being transferred without affecting the original decoding. When this parameter is set to 1, only one layer is transferred at a time, and then the weight and routing table of the layer are updated.

The formula for calculating the affected memory is as follows: buffer_expert_layer_num × local_experts_num × 44 MB (44 MB is the size of an int8 expert).

num_expert_update_ready_countdown

Integer

≥ 1

This parameter is not supported in the current version.

Frequency of checking whether the host-to-device transfer is complete, in the unit of decoding times.

Because weight transfer is asynchronous, the weight and routing table can be updated only after all EP cards are transferred. Communication is introduced here. When there are a large number of transfer layers, the frequency can be reduced to lower the overhead on the EPLB framework side.

h3p

enable_qkvdown_dp

Bool

  • true
  • false

Whether to enable the "qkvdown dp" feature to reduce the computing and communication traffic and improve the performance in the prefill phase.

Default value: true

enable_gating_dp

Bool

  • true
  • false

Whether to enable the "gating dp" feature to reduce the computing and communication traffic and improve the performance in the prefill phase.

Default value: true

This feature is supported only when ep_level is set to 1.

enable_shared_expert_dp

Bool

  • true
  • false

Whether to enable the "shared expert dp" feature to improve the performance in the prefill phase.

Default value: false

  • This feature is supported only when ep_level is set to 1.
  • If this function is enabled, extra graphics memory will be occupied, which may cause the "out of memory" error message. You are advised to retain the default value.

enable_shared_expert_overlap

Bool

  • true
  • false

Whether to enable the communication-computing dual-stream overlapping feature for shared experts to improve the performance in the prefill phase in specific scenarios (the input sequence length is 2K to 16K).

Default value: false

  • This feature is supported only when ep_level is set to 1 and enable_shared_expert_dp is set to true.
  • If this function is enabled, extra graphics memory will be occupied, which may cause the "out of memory" error message. You are advised to retain the default value.

enable_dispatch_combine_v2

Bool

  • true
  • false

Whether to enable the v2 version of the dispatch and combine operators when ep_level is set to 2 to improve the performance in the decoding phase.

Default value: true

mix_shared_routing

Bool

  • true
  • false

Whether to merge shared experts and route experts to achieve parallel computing for them.

  • This feature cannot be used together with the CP feature.
  • In the prefill-decode disaggregation scenario, this feature can be enabled only on the decode node.
  • Default value: false