Configuration Parameters (Model Side)

The path of the configuration file in the atb-models installation directory on the model side is ${ATB_SPEED_HOME_PATH}/atb_llm/conf/config.json.

The format of the model configuration file config.json is as follows:

{
  "llm": {
    "ccl": {
      "enable_mc2": "true"
    },
    "stream_options": {
      "micro_batch": "false"
    },
    "engine": {
      "graph": "cpp"
    },
    "parallel_options": {
      "o_proj_local_tp": -1,
      "dense_mlp_local_tp": -1,
      "lm_head_local_tp": -1,
      "hccl_buffer": 128,
      "hccl_moe_ep_buffer": 512,
      "hccl_moe_tp_buffer": 64
    },
    "pmcc_obfuscation_options": {
      "enable_model_obfuscation": false,
      "data_obfuscation_ca_dir": "",
      "kms_agent_port": 1024
    },
    "kv_cache_options": {
      "enable_nz": false
    },
    "weights_options": {
      "low_cpu_memory_mode": false
    },
    "enable_reasoning": "false",
    "tool_call_options": {
        "tool_call_parser": ""
    },
    "chat_template": "",
    "ep_level": 1,
    "communication_backend": {
        "prefill": "lccl",
        "decode": "lccl"
    }
  },
  "models": {
    "qwen_moe": {
      "eplb": {
        "level": 0,
        "expert_map_file": ""
      },
      "ep_level": 2
    },
    "deepseekv2": {
      "eplb": {
        "level": 0,
        "expert_map_file": "",
        "num_redundant_experts": 0,
        "aggregate_threshold": 128,
        "num_expert_update_ready_countdown": 16
      },
      "ep_level": 1,
      "enable_dispatch_combine_v2": true,
      "communication_backend": {
        "prefill":"lccl",
        "decode": "lccl"
      },
      "mix_shared_routing": false,
      "enable_gmmswigluquant": false,
      "enable_oproj_prefetch": false,
      "enable_mlapo_prefetch": false,
      "num_dangling_shared_experts": 0,
      "enable_swiglu_quant_for_shared_experts": false,
      "enable_init_routing_cutoff": false,
      "topk_scaling_factor": 1.0,
      "h3p":{
        "enable_qkvdown_dp": "true",
        "enable_gating_dp": "true",
        "enable_shared_expert_dp": "false",
        "enable_shared_expert_overlap": "false"
      }
    }
  }
}

Parameters in llm

Parameter	Value Type	Value Range	Description
enable_reasoning	Bool	true false	Whether to enable model output parsing. The output is parsed into the reasoning content and content fields. false: disable true: enable Mandatory. The default value is false. This function can be enabled only for the Qwen3-32B, Qwen3-30B-A3B, DeepSeek-R1-671B, and DeepSeek-V3.1 models.
chat_template	String	File path in .jinja format. ""	Input a custom dialog template to replace the default one of the model. Default value: "" For DeepSeek models, the default chat_template in tokenizer_config.json cannot be called using tools. You can use this parameter to input the chat_template that can be called using tools. This parameter can be used to input a custom template for DeepSeek, Qwen (large language model), ChatGLM, and Llama models.
tool_call_options
tool_call_parser	String	Optional registered names in the registered ToolsCallProcessor names. For details, see Table 2. ""	Parsing mode of the tool when Function Call is enabled. Default value: "" If this parameter is not set or is set to an incorrect value, the default tool parsing mode of the current model will be used. When DeepSeek V3.1 uses Function Call, this parameter must be set to deepseek_v31. For other models, use the default value. This parameter is used together with chat_template. The corresponding ToolsCallProcessor is selected based on the Function Call format specified in chat_template.
ccl
enable_mc2	Bool	true false	Whether to enable the communication-computing fused operator feature. Default value: true This feature cannot be enabled together with the communication-computing dual-stream overlapping feature.
stream_options
micro_batch	Bool	true false	Whether to enable the communication-computing dual-stream overlapping feature. This feature cannot be enabled together with the communication-computing fused operator feature. This feature cannot be enabled together with the Python graph. Only the Qwen2.5-14B, Qwen3-14B, DeepSeek-R1, and DeepSeek-V3.1 models support this feature. Enabling this feature will occupy extra graphics memory. In serving scenarios, if the number of KV caches decreases, scheduling will be affected and the throughput will decrease. Therefore, you are advised not to enable this feature when the graphics memory is limited. Default value: false
engine
graph	String	cpp python	Enables the cpp graph or Python graph. Only the Llama3.1-8B, Qwen2.5-7B, Qwen3-14B, and Qwen3-32B models support the Python graph. To enable the low CPU and memory usage mode (low_cpu_memory_mode), you need to enable the Python graph. Default value: cpp
parallel_options
o_proj_local_tp	Integer	[1, worldSize / Number of nodes]	Split count for the Attention O matrix. Only the DeepSeek-R1, DeepSeek-V3, and DeepSeek-V3.1 models support this feature. Default value: -1, indicating that splitting is disabled
lm_head_local_tp	Integer	[1, worldSize / Number of nodes]	Tensor parallel split count for the LmHead layer. Only the DeepSeek-R1, DeepSeek-V3, and DeepSeek-V3.1 models support this feature. Default value: -1, indicating that splitting is disabled
hccl_buffer	Integer	≥ 1	Buffer size of the shared data in communicators except the MoE communicator. Default value: 128 If the value is too large, error message "out of memory" will be displayed. The default value is recommended.
hccl_moe_ep_buffer	Integer	≥ 512	Buffer size of the shared data in the MoE EP communicator. Default value: 512 If the value is too large, error message "out of memory" will be displayed. The default value is recommended.
hccl_moe_tp_buffer	Integer	≥ 64	Buffer size of the shared data in the MoE TP communicator. Default value: 64 If the value is too large, error message "out of memory" will be displayed. The default value is recommended.
kv_cache_options
enable_nz	Bool	true false	Specifies whether to enable the NZ format for the KV cache. Only the DeepSeek-R1, DeepSeek-V3, and DeepSeek-V3.1 models support this feature. The NZ format is automatically enabled in the FA3 quantization scenario. Default value: false
weights_options
low_cpu_memory_mode	Bool	true false	Specifies whether to enable the low CPU and memory usage mode. This feature must be enabled together with the Python graph. Only the Qwen2.5-7B model supports this feature. The default value is false (disabling the feature). NOTE: After this function is enabled, model parameters will be loaded tensor by tensor in the weight loading phase, which significantly reduces the CPU and memory usage. This function is especially suitable for memory-limited scenarios such as edge devices and small-specification servers. In an environment with sufficient CPU and memory resources, you are advised to disable this function to reduce the loading time.

Parameters in models

Parameter	Value Type	Value Range	Description
deepseekv2	map	-	deepseekv2 configuration. For details, see Parameters in deepseekv2.

Parameters in deepseekv2

Parameter	Value Type	Value Range	Description
ep_level	Integer	[1,2]	Implementation form of expert parallelism (EP). 1: EP based on AllGather communication 2: EP based on AllToAll and communication-computing fusion
topk_scaling_factor	Float	(0,1]	Top k result truncation parameter. When ep_level is set to 1, the latter part of hidden_states of each device is invalid data. You can set the truncation parameter to reduce the graphics memory overhead. In addition, enable_init_routing_cutoff must be set to true.
enable_init_routing_cutoff	Bool	true false	Whether to allow top k result truncation. The default value is false (disabling the feature). This parameter can be set when ep_level is set to 1.
alltoall_ep_buffer_scale_factors	list[list[int, float]]	Each member in the list contains two numbers. The first number is a non-negative integer, and the second number is a floating-point number greater than 0. The members are sorted in descending order based on the first number.	Size of the AllToAll communication buffer. The second-level list contains two elements. The first number is the sequence length, and the second number is the buffer coefficient. The sequence length is the condition for selecting the buffer coefficient. Example: [[1048576, 1.32], [524288, 1.4], [262144, 1.53], [131072, 1.8], [32768, 3.0], [8192, 5.2], [0, 8.0]] You are advised to configure this parameter when ep_level is set to 2 and you need to manage the graphics memory in a refined manner. This parameter does not take effect when ep_level is set to 1.
num_dangling_shared_experts	Integer	Positive integer	Number of external shared experts. Currently, only the scenario where the Atlas 800I A3 SuperPoD Server 144 card is used and load balancing is disabled is supported. The recommended value is 32. The default value is 0 (disabling the feature).
enable_mlapo_prefetch	Bool	true false	Enables or disables mlapo prefetch. true: enable false: disable Default value: false
enable_oproj_prefetch	Bool	true false	Enables or disables oproj prefetch. For the Atlas 800I A2 inference server, you are advised not to enable this feature. For the Atlas 800I A3 SuperPoD Server, you are advised to enable this feature and OprojTp at the same time, and set OprojTp to 2. true: enable false: disable Default value: false
eplb
level	Integer	[0, 3]	0: disables load balancing. 1: enables static load balancing in redundancy mode. 2: enables dynamic load balancing in redundancy mode (not supported currently). 3: enables forcible load balancing. Default value: 0
expert_map_file	String	The file path exists.	Path of the expert deployment table for static load balancing in redundancy mode. Default value: ""
num_redundant_experts	Integer	[0, n_routed_experts]	This parameter is not supported in the current version. Number of redundant experts. Default value: 0
aggregate_threshold	Integer	≥ 1	This parameter is not supported in the current version. Frequency of triggering the dynamic EPLB algorithm, in the unit of decoding times. For example, 50 indicates that the dynamic EPLB algorithm is triggered once for 50 decoding times. If the algorithm considers that the popularity exceeds a certain threshold, the routing table is adjusted to reduce the algorithm popularity.
buffer_expert_layer_num	Integer	[1, num_moe_layers]	This parameter is not supported in the current version. Number of layers transferred by dynamic EPLB each time. Because weight transfer is asynchronous, an extra buffer memory is required to store the new weight that is being transferred without affecting the original decoding. When this parameter is set to 1, only one layer is transferred at a time, and then the weight and routing table of the layer are updated. The formula for calculating the affected memory is as follows: buffer_expert_layer_num × local_experts_num × 44 MB (44 MB is the size of an int8 expert).
num_expert_update_ready_countdown	Integer	≥ 1	This parameter is not supported in the current version. Frequency of checking whether the host-to-device transfer is complete, in the unit of decoding times. Because weight transfer is asynchronous, the weight and routing table can be updated only after all EP cards are transferred. Communication is introduced here. When there are a large number of transfer layers, the frequency can be reduced to lower the overhead on the EPLB framework side.
h3p
enable_qkvdown_dp	Bool	true false	Whether to enable the "qkvdown dp" feature to reduce the computing and communication traffic and improve the performance in the prefill phase. Default value: true
enable_gating_dp	Bool	true false	Whether to enable the "gating dp" feature to reduce the computing and communication traffic and improve the performance in the prefill phase. Default value: true This feature is supported only when ep_level is set to 1.
enable_shared_expert_dp	Bool	true false	Whether to enable the "shared expert dp" feature to improve the performance in the prefill phase. Default value: false This feature is supported only when ep_level is set to 1. If this function is enabled, extra graphics memory will be occupied, which may cause the "out of memory" error message. You are advised to retain the default value.
enable_shared_expert_overlap	Bool	true false	Whether to enable the communication-computing dual-stream overlapping feature for shared experts to improve the performance in the prefill phase in specific scenarios (the input sequence length is 2K to 16K). Default value: false This feature is supported only when ep_level is set to 1 and enable_shared_expert_dp is set to true. If this function is enabled, extra graphics memory will be occupied, which may cause the "out of memory" error message. You are advised to retain the default value.
enable_dispatch_combine_v2	Bool	true false	Whether to enable the v2 version of the dispatch and combine operators when ep_level is set to 2 to improve the performance in the decoding phase. Default value: true
mix_shared_routing	Bool	true false	Whether to merge shared experts and route experts to achieve parallel computing for them. This feature cannot be used together with the CP feature. In the prefill-decode disaggregation scenario, this feature can be enabled only on the decode node. Default value: false

Parent topic: Core Concepts and Configurations