Configuration Parameters (Service-Specific)

  • The Server configuration file is config.json, which is stored in {MindIE installation directory}/latest/mindie-service/conf/config.json.
  • When reading the configuration file, the system checks the file size first. If the file size is not in the range (0 MB, 10 MB], the system fails to read the file.

Parameters in the Configuration File

Parameter

Value Type

Value Range

Description

Version

std::string

"1.0.0"

Configuration file version, which is fixed at 1.0.0 and cannot be modified.

ServerConfig

map

-

Server configuration, such as ip:port, network request, and network security. For details, see Parameters in ServerConfig.

BackendConfig

map

-

Model backend configuration, including scheduling and model configurations. For details, see Parameters in BackendConfig.

LogConfig

map

-

Log-level configuration. For details, see Parameters in LogConfig.

EnableDynamicAdjustTimeoutConfig

bool

  • true
  • false

Dynamic configuration parameter of the timeout interval.

If this parameter is set to true, all inference-related timeout intervals are dynamically set to the maximum value.

Parameters in ServerConfig

Parameter

Value Type

Value Range

Description

ipAddress

std::string

IPv4 or IPv6 address.

Mandatory. The default value is 127.0.0.1.

IP address bound to the service-plane RESTful API and provided by EndPoint.

  • If MIES_CONTAINER_IP exists, use its value as the service-plane IP address.
  • If MIES_CONTAINER_IP does not exist, the value of ipAddress is used.
NOTE:

If all-zero monitoring is enabled, triple-plane isolation becomes invalid, which does not meet security configuration requirements. Therefore, the IP address cannot be set to 0.0.0.0 by default. If the IP address is set to 0.0.0.0, set allowAllZeroIpListening in the configuration file to true to ensure security.

managementIpAddress

std::string

IPv4 or IPv6 address.

Optional. The default value is 127.0.0.2.

IP address bound to the internal RESTful API and provided by EndPoint.

  • If MIES_CONTAINER_MANAGEMENT_IP exists, use its value as the internal API IP address.
  • If managementIpAddress exists, use its value. Otherwise, use the value of ipAddress as the internal API IP address.
  • If multiple IP addresses are used, the initial values of ipAddress and managementIpAddress must be changed accordingly.
NOTE:

If all-zero monitoring is enabled, triple-plane isolation becomes invalid, which does not meet security configuration requirements. Therefore, the IP address cannot be set to 0.0.0.0 by default. If the IP address is set to 0.0.0.0, set allowAllZeroIpListening in the configuration file to true to ensure security.

port

int32_t

[1024, 65535]

Mandatory. The default value is 1025.

Port number bound to the service-plane RESTful API provided by EndPoint.

If the IP address of a physical machine (PM) or host is used for communication, ensure that the port number does not conflict.

managementPort

int32_t

[1024, 65535]

Optional. The default value is 1026.

Port number bound to the internal APIs provided by EndPoint. (For details about the internal APIs, see Table 1.)

There are four solutions for the service plane and internal APIs:
  • Multiple IP addresses with multiple port numbers (recommended)
  • Multiple IP addresses with a single port number
  • Single IP address with multiple port numbers
  • Single IP address with a single port number

metricsPort

int32_t

[1024, 65535]

Optional. The default value is 1027.

Port number of the service management and control metric API (Prometheus format). The value can be the same as or different from that of managementPort.

allowAllZeroIpListening

Bool

  • true
  • false

Mandatory. The default and recommended value is false. If the value is true, all-zero monitoring may pose risks to the user environment, which requires the protection capabilities of the environment.

Whether to support all-zero monitoring IP addresses.

  • true: supports all-zero monitoring IP addresses.
  • false: does not support all-zero monitoring IP addresses.

maxLinkNum

uint32_t

[1, 4096]

Mandatory. The default value is 1000.

Maximum number of concurrent RESTful requests supported by EndPoint.

This parameter indicates that maxLinkNum requests are being processed concurrently and 2 × maxLinkNum requests are waiting in the queue. Therefore, the request at position (3 × maxLinkNum + 1) is rejected.

The recommended value is 300. This parameter is affected by model performance. Typically, 1,000 concurrent requests can be used only for a small model with short sequence lengths.

httpsEnabled

Bool

  • true
  • false

Mandatory. The default and recommended value is true. You are advised to enable this function. If it is disabled, high network security risks exist.

Whether to enable HTTPS communication security authentication.

  • true: enables HTTPS communication.
  • false: disables HTTPS communication.

If this parameter is set to false, subsequent HTTPS communication parameters are ignored.

fullTextEnabled

Bool

  • true
  • false

Optional. The default value is false.

Whether to enable the streaming API to return all historical results.

  • true: enables the streaming API to return all historical results.
  • false: disables the streaming API to return all historical results.

tlsCaPath

std::string

The length range of the absolute file path is [1, 4096]. The actual path is the combined result of the project path and tlsCaPath.

Root certificate path. Only the relative path under the software package installation path is supported.

This parameter takes effect when httpsEnabled is set to true and is mandatory in this case. The default value is security/ca/.

tlsCaFile

std::set<std::string>

The length range of the absolute file path is [1, 4096]. The minimum number of elements in the list is 1 and the maximum number is 3.

List of root certificate names on the service plane.

This parameter takes effect when httpsEnabled is set to true and is mandatory in this case. The default value is ["ca.pem"].

tlsCert

std::string

The length range of the absolute file path is [1, 4096]. The actual path is the combined result of the project path and tlsCert.

Path of the service certificate file on the service plane. Only the relative path under the software package installation path is supported.

This parameter takes effect when httpsEnabled is set to true and is mandatory in this case. The default value is security/certs/server.pem.

tlsPk

std::string

The length range of the absolute file path is [1, 4096]. The actual path is the combined result of the project path and tlsPk.

Path of the private key file of the service certificate on the service plane. Only the relative path under the software package installation path is supported. The length of the private key must be greater than or equal to 3072.

This parameter takes effect when httpsEnabled is set to true and is mandatory in this case. The default value is security/keys/server.key.pem.

tlsPkPwd

std::string

The length range of the absolute file path is [0, 4096]. The actual path is the combined result of the project path and tlsPkPwd.

Path of the encryption private key file of the service certificate on the service plane. Only the relative path under the software package installation path is supported.

This parameter takes effect when httpsEnabled is set to true. This parameter is optional. The default value is security/pass/key_pwd.txt.

If the private key is encrypted but this file is not provided, the system prompts you to enter the private key encryption password in the interactive window when the system is started.

tlsCrlPath

std::string

The path length range of tlsCrlPath+tlsCrlFiles is [0, 4096]. The actual path is the combined result of the project path and tlsCrlPath.

Path of the service certificate CRL directory on the service plane. Only the relative path under the software package installation path is supported.

  • This parameter takes effect when httpsEnabled is set to true. This parameter is optional. The default value is security/certs/.
  • If httpsEnabled is false, the CRL is disabled.

tlsCrlFiles

std::set<std::string>

The path length range of tlsCrlPath+tlsCrlFiles is [1, 4096]. The minimum number of elements in the list is 1 and the maximum number is 3.

CRL name list on the service plane.

If httpsEnabled is set to true, this parameter is optional. The default value is ["server_crl.pem"].

managementTlsCaFile

std::set<std::string>

It is recommended that the length range of tlsCaPath+managementTlsCaFile be [0, 4096]. The minimum number of elements in the list is 1 and the maximum number is 3.

Name list of root certificates for internal APIs. The certificates of the internal APIs and service plane are stored in the same path (tlsCaPath).

This parameter takes effect when httpsEnabled = true and ipAddress != managementIpAddress, and is mandatory in this case. The default value is ["management_ca.pem"].

managementTlsCert

std::string

The length range of the file path is [1, 4096]. The actual path is the combined result of the project path and managementTlsCert.

Path of the service certificate file for internal APIs. Only the relative path under the software package installation path is supported.

This parameter takes effect when httpsEnabled = true and ipAddress != managementIpAddress, and is mandatory in this case. The default value is security/certs/management/server.pem.

managementTlsPk

std::string

The length range of the file path is [1, 4096]. The actual path is the combined result of the project path and managementTlsPk.

Path of the private key file of the service certificate for internal APIs. Only the relative path under the software package installation path is supported. The length of the private key must be greater than or equal to 3072.

This parameter takes effect when httpsEnabled = true and ipAddress != managementIpAddress, and is mandatory in this case. The default value is security/keys/management/server.key.pem.

managementTlsPkPwd

std::string

The length range of the file path is [0, 4096]. The actual path is the combined result of the project path and managementTlsPkPwd.

Path of the encryption private key file of the service certificate for internal APIs.

This parameter takes effect when httpsEnabled = true and ipAddress != managementIpAddress. This parameter is optional. The default value is security/pass/management/key_pwd.txt.

If the private key is encrypted but this file is not provided, the system prompts you to enter the private key encryption password in the interactive window when the system is started.

managementTlsCrlPath

std::string

The path length range of managementTlsCrlPath+managementTlsCrlFiles is [1, 4096]. The actual path is the combined result of the project path and managementTlsCrlPath.

Path of the certificate CRL folder for internal APIs. Only the relative path under the software package installation path is supported.

  • This parameter takes effect when httpsEnabled = true and ipAddress != managementIpAddress. This parameter is optional. The default value is security/management/certs/.
  • If httpsEnabled is false, the CRL is disabled.

managementTlsCrlFiles

std::set<std::string>

The path length range of managementTlsCrlPath+managementTlsCrlFiles is [1, 4096]. The minimum number of elements in the list is 1 and the maximum number is 3.

CRL name list for internal APIs.

If httpsEnabled is set to true, this parameter is optional. The default value is ["server_crl.pem"].

metricsTlsCaFile

std::set<std::string>

It is recommended that the length range of tlsCaPath+metricsTlsCaFile be [0, 4096]. The minimum number of elements in the list is 1 and the maximum number is 3.

Name list of root certificates for internal APIs. The certificates of the internal APIs and service plane are stored in the same path (tlsCaPath).

This parameter takes effect when httpsEnabled = true and ipAddress != managementIpAddress, and is mandatory in this case. The default value is ["metrics_ca.pem"].

metricsTlsCert

std::string

The length range of the file path is [1, 4096]. The actual path is the combined result of the project path and metricsTlsCert.

Path of the service certificate file for internal APIs. Only the relative path under the software package installation path is supported.

This parameter takes effect when httpsEnabled = true and ipAddress != managementIpAddress, and is mandatory in this case. The default value is security/certs/metrics/server.pem.

metricsTlsPk

std::string

The length range of the file path is [1, 4096]. The actual path is the combined result of the project path and metricsTlsPk.

Path of the private key file of the service certificate for internal APIs. Only the relative path under the software package installation path is supported. The length of the private key must be greater than or equal to 3072.

This parameter takes effect when httpsEnabled = true and ipAddress != managementIpAddress, and is mandatory in this case. The default value is security/keys/metrics/server.key.pem.

metricsTlsPkPwd

std::string

The length range of the file path is [0, 4096]. The actual path is the combined result of the project path and metricsTlsPkPwd.

Path of the encryption private key file of the service certificate for internal APIs.

This parameter takes effect when httpsEnabled = true and ipAddress != managementIpAddress. This parameter is optional. The default value is security/pass/metrics/key_pwd.txt.

If the private key is encrypted but this file is not provided, the system prompts you to enter the private key encryption password in the interactive window when the system is started.

metricsTlsCrlPath

std::string

The path length range of metricsTlsCrlPath+metricsTlsCrlFiles is [1, 4096]. The actual path is the combined result of the project path and managementTlsCrlPath.

Path of the certificate CRL folder for internal APIs. Only the relative path under the software package installation path is supported.

  • This parameter takes effect when httpsEnabled = true and ipAddress != managementIpAddress. This parameter is optional. The default value is security/metrics/certs/.
  • If httpsEnabled is false, the CRL is disabled.

metricsTlsCrlFiles

std::set<std::string>

The path length range of metricsTlsCrlPath+metricsTlsCrlFiles is [1, 4096]. The minimum number of elements in the list is 1 and the maximum number is 3.

CRL name list for internal APIs.

If httpsEnabled is set to true, this parameter is optional. The default value is ["server_crl.pem"].

kmcKsfMaster

std::string

The length range of the file path is [1, 4096]. The actual path is the combined result of the project path and kmcKsMaster.

Path of the KMC keystore file. Only the relative path under the software package installation path is supported.

This parameter takes effect when httpsEnabled is set to true and is mandatory in this case. The default value is tools/pmt/master/ksfa.

kmcKsfStandby

std::string

The length range of the file path is [1, 4096]. The actual path is the combined result of the project path and kmcKsStandby1.

Path of the KMC keystore backup file. Only the relative path under the software package installation path is supported.

This parameter takes effect when httpsEnabled is set to true and is mandatory in this case. The default value is tools/pmt/standby/ksfb.

inferMode

std::string

  • standard
  • dmi

Mandatory. The default value is standard.

Whether prefill-decode disaggregation is enabled.

  • standard: prefill-decode mixed mode
  • dmi: prefill-decode disaggregation

interCommTLSEnabled

Bool

  • true
  • false

Optional. The default value is true. You need to configure related certificates.

Whether to enable TLS for communication between instances in a cluster.

  • true: enable
  • false: disable

If the value is false or inferMode is standard, ignore the parameters related to internal communication of a cluster.

interCommPort

uint16_t

[1024, 65535]

Optional. The default value is 1121.

Communication port between instances in a cluster.

interCommTlsCaPath

std::string

The path length of interCommTlsCaPath+interCommTlsCaFiles depends on the OS configuration (PATH_MAX for Linux). The actual path is the combined result of the project path and interCommTlsCaPath.

Optional. The default value is security/grpc/ca/.

If TLS is enabled for communication between instances in a cluster, this parameter is used to specify the path of the CA file.

interCommTlsCaFiles

std::set<std::string>

The path length of interCommTlsCaPath+interCommTlsCaFiles depends on the OS configuration (PATH_MAX for Linux). The actual path is the combined result of the project path and interCommTlsCaFiles.

Optional. The default value is ["ca.pem"].

If TLS is enabled for communication between instances in a cluster, this parameter is used to specify the name of the CA file.

interCommTlsCert

std::string

The path length depends on the OS configuration (PATH_MAX for Linux). The actual path is the combined result of the project path and interCommTlsCert.

Optional. The default value is security/grpc/certs/server.pem.

If TLS is enabled for communication between instances in a cluster, the file specified by this parameter is used as the certificate.

interCommPk

std::string

The path length depends on the OS configuration (PATH_MAX for Linux). The actual path is the combined result of the project path and interCommPk, which depends on the OS configuration (PATH_MAX for Linux).

Optional. The default value is security/grpc/keys/server.key.pem.

If TLS is enabled for communication between instances in a cluster, the file specified by this parameter is used as the private key.

interCommPkPwd

std::string

The path length depends on the OS configuration (PATH_MAX for Linux). The actual path is the combined result of the project path and interCommPkPwd.

Optional. The default value is security/grpc/pass/key_pwd.txt.

If TLS is enabled for communication between instances in a cluster, the file specified by this parameter is used as the private key password.

interCommTlsCrlPath

std::string

The path length of interCommTlsCrlPath+interCommTlsCrlFiles depends on the OS configuration (PATH_MAX for Linux). The actual path is the combined result of the project path and interCommTlsCrlPath.

Optional. The default value is security/grpc/certs/.

If TLS is enabled for communication between instances in a cluster, use this parameter to specify the path of the CRL file.

interCommTlsCrlFiles

std::set<std::string>

The path length of interCommTlsCrlPath+interCommTlsCrlFiles depends on the OS configuration (PATH_MAX for Linux). The actual path is the combined result of the project path and interCommTlsCrlFiles.

Optional. The default value is ["server_crl.pem"].

If TLS is enabled for communication between instances in a cluster, use this parameter to specify the name of the CRL file.

openAiSupport

std::string

String

Optional. The default value is vllm.

Whether to use OpenAI compatible with vLLM.

  • If the value is vllm or the field is missing, the /v1/chat/completions interface uses an OpenAI interface version compatible with vLLM.
  • If the value is any other character, the /v1/chat/completions interface uses the native OpenAI interface version.

This parameter supports hot update.

tokenTimeout

uint32_t

[1, 3600]

Inference timeout interval of each token. The default value is 600, in seconds.

In the prefill-decode disaggregation scenario, this parameter must be set to the same value for prefill and decode nodes.

e2eTimeout

uint32_t

[1, 65535]

End-to-end (from request receiving to inference completion) timeout interval. The default value is 600, in seconds.

In the prefill-decode disaggregation scenario, this parameter must be set to the same value for prefill and decode nodes.

maxRequestLength

uint32_t

[1, 100]

Optional. The default value is 40, in MB.

Maximum number of characters input in the request body.

maxJsonDepth

uint32_t

[10, 100]

Optional. The default value is 10.

The maximum nesting depth of JSON in the input request.

distDPServerEnabled

Bool

  • true
  • false

Mandatory. The default value is false.

Whether to enable distributed deployment. This parameter is takes effect only in MoE EP scenarios.

  • If the network environment is insecure, HTTPS communication will be disabled (httpsEnabled = false) due to high network security risks.
  • If the compute node where the inference service is conducted is networked across both the WAN and LAN, the IP address bound to 0.0.0.0 could compromise network isolation, leading to significant security vulnerabilities. Therefore, the EndPoint IP address cannot be bound to 0.0.0.0 in this scenario by default. If you still need to use 0.0.0.0, ensure that the environment has the protection capability for all-zero monitoring and set allowAllZeroIpListening to true to manually allow all-zero monitoring. You need to bear the security risks of enabling all-zero monitoring.
  • In addition to the tokenTimeout and e2eTimeout parameters in the configuration file, the time parameters related to inference timeout also include the timeout parameter from the client in some APIs (for example, Token Inference API). If either of the two timeout intervals is reached, a timeout occurs. When a request times out, the Server returns a timeout error to the client and terminates the inference process of the request.

Parameters in BackendConfig

Parameter

Value Type

Value Range

Description

backendName

std::string

The value is a string of 1 to 50 characters and can contain only lowercase letters and underscores (_). The value cannot start or end with an underscore (_).

Mandatory. Only mindieservice_llm_engine is supported.

Inference backend name. You can use this parameter to obtain the backend instance.

modelInstanceNumber

uint32_t

[1, 10]

Mandatory. The default value is 1.

Number of model instances.

In the single-model multi-node inference scenario, the value must be 1.

npuDeviceIds

std::vector<std::set<size_t>>

Set this parameter based on the model and environment.

Mandatory. The default value is [[0,1,2,3]].

Devices to be enabled. npuIds allocated to each model instance is represented by the logical processor ID.

  • If ASCEND_RT_VISIBLE_DEVICES is not configured, you can run the npu-smi info -m command to query the logical ID of each device.
  • If ASCEND_RT_VISIBLE_DEVICES is configured, the logical IDs of visible devices start from 0 based on the sequence configured in ASCEND_RT_VISIBLE_DEVICES.

    For example:

    ASCEND_RT_VISIBLE_DEVICES=1,2,3,4

    The logical IDs of the visible devices are 0, 1, 2, and 3 in sequence.

    This parameter is invalid in multi-node inference scenarios. The value of npuDeviceIds used on each node is calculated based on ranktable.

tokenizerProcessNumber

uint32_t

[1, 32]

Mandatory. The default value is 8.

Number of tokenizer processes.

When there are many CPU cores, you can increase the value to improve tokenizer performance.

multiNodesInferEnabled

Bool

  • true
  • false

Optional. The default value is false.

  • false: single-node inference
  • true: multi-node inference

multiNodesInferPort

int32_t

[1024, 65535]

Optional. The default value is 1120.

Port number for cross-machine communication, which is used in multi-node inference scenarios.

interNodeTLSEnabled

Bool

  • true
  • false

Optional. The default value is true. If this parameter is set to false, subsequent parameters can be ignored.

Whether to enable certificate security authentication for cross-machine communication during multi-node inference.

  • true: enables certificate security authentication.
  • false: disables certificate security authentication.

interNodeTlsCaPath

std::string

It is recommended that the path length of interNodeTlsCaPath+interNodeTlsCaFiles be less than or equal to 4096. The actual path is the combined result of the project path and interNodeTlsCaPath. The upper limit depends on the operating system. The minimum length is 1.

Path of root certificate names. Only the relative path under the software package installation path is supported.

This parameter takes effect when interNodeTLSEnabled is set to true and is mandatory in this case. The default value is security/grpc/ca/.

interNodeTlsCaFiles

std::set<std::string>

It is recommended that the path length of interNodeTlsCaPath+interNodeTlsCaFiles be less than or equal to 4096. The actual path is the combined result of the project path and interNodeTlsCaFiles. The upper limit depends on the operating system. The minimum length is 1.

Root certificate name list.

This parameter takes effect when interNodeTLSEnabled is set to true and is mandatory in this case. The default value is ["ca.pem"].

interNodeTlsCert

std::string

It is recommended that the length of the file path be less than or equal to 4096. The actual path is the combined result of the project path and interNodeTlsCert. The upper limit depends on the operating system. The minimum length is 1.

Path of the service certificate file. Only the relative path under the software package installation path is supported.

This parameter takes effect when interNodeTLSEnabled is set to true and is mandatory in this case. The default value is security/grpc/certs/server.pem.

interNodeTlsPk

std::string

It is recommended that the length of the file path be less than or equal to 4096. The actual path is the combined result of the project path and interNodeTlsPk. The upper limit depends on the operating system. The minimum length is 1.

Path of private key file of the service certificate. Only the relative path under the software package installation path is supported.

This parameter takes effect when interNodeTLSEnabled is set to true and is mandatory in this case. The default value is security/grpc/keys/server.key.pem.

interNodeTlsPkPwd

std::string

It is recommended that the length of the file path be less than or equal to 4096. This parameter can be left empty. If this parameter is not left empty, the actual path is the combined result of the project path and interNodeTlsPkPwd. The upper limit depends on the operating system. The minimum length is 1.

Path of the encryption private key file of the service certificate. Only the relative path under the software package installation path is supported.

This parameter takes effect when interNodeTLSEnabled is set to true and is mandatory in this case. The default value is security/grpc/pass/mindie_server_key_pwd.txt.

interNodeTlsCrlPath

std::string

It is recommended that the path length of interNodeTlsCrlPath+interNodeTlsCrlFiles be less than or equal to 4096. The actual path is the combined result of the project path and interNodeTlsCrlPath. The upper limit depends on the operating system. The minimum length is 1.

Optional. The default value is security/grpc/certs/.

Path of the service certificate CRL directory. This parameter takes effect when interNodeTLSEnabled is set to true.

interNodeTlsCrlFiles

std::set<std::string>

It is recommended that the path length of interNodeTlsCrlPath+interNodeTlsCrlFiles be less than or equal to 4096. The actual path is the combined result of the project path and interNodeTlsCrlFiles. The upper limit depends on the operating system. The minimum length is 1.

Optional. The default value is ["server_crl.pem"].

Service certificate CRL. This parameter takes effect when interNodeTLSEnabled is set to true.

interNodeKmcKsfMaster

std::string

It is recommended that the length of the file path be less than or equal to 4096. The actual path is the combined result of the project path and interNodeKmcKsfMaster. The upper limit depends on the operating system. The minimum length is 1.

Path of the KMC keystore file. Only the relative path under the software package installation path is supported.

This parameter takes effect when interNodeTLSEnabled is set to true and is mandatory in this case. The default value is tools/pmt/master/ksfa.

interNodeKmcKsfStandby

std::string

It is recommended that the length of the file path be less than or equal to 4096. The actual path is the combined result of the project path and interNodeKmcKsfStandby. The upper limit depends on the operating system. The minimum length is 1.

Path of the KMC keystore backup file. Only the relative path under the software package installation path is supported.

This parameter takes effect when interNodeTLSEnabled is set to true and is mandatory in this case. The default value is tools/pmt/standby/ksfb.

ModelDeployConfig

map

-

Configurations for model deployment. For details, see Parameters in ModelDeployConfig.

ScheduleConfig

map

-

Scheduling configurations. For details, see Parameters in ScheduleConfig.

Parameters in ModelDeployConfig

Parameter

Value Type

Value Range

Description

maxSeqLen

uint32_t

The upper limit is determined by the graphics memory and user requirements. The minimum value must be greater than 0.

Mandatory. The default value is 2560.

Maximum sequence length. Select a proper value for maxSeqLen based on the inference scenario.

If maxSeqLen is greater than the maximum sequence length supported by the model, the inference accuracy may be affected.

maxInputTokenLen

uint32_t

[1, 4194304]

Mandatory. The default value is 2048.

Maximum length of the input token ID.

maxInputTokenLen = min(maxInputTokenLen, maxSeqLen – 1)

  • truncation = true:

    inputLen in a request is automatically truncated, and the actual input length is calculated by the following formula: inputLen = min(inputLen, maxInputLen).

  • truncation = false:

    If inputLen is greater than maxInputTokenLen, an error is returned.

truncation

Bool

  • true
  • false

Optional. The default value is false.

Whether to truncate parameter rationalization verification.

  • false: verifies parameter rationalization.
  • true: does not verify parameter rationalization.

maxInputTokenLen = min(maxInputTokenLen, maxSeqLen – 1)

  • truncation = true:

    inputLen in a request is automatically truncated, and the actual input length is calculated by the following formula: inputLen = min(inputLen, maxInputLen).

  • truncation = false:

    If inputLen is greater than maxInputTokenLen, an error is returned.

ModelConfig

map

-

Model configurations, including postprocessing parameters. For details, see Parameters in ModelConfig.

Parameters in ModelConfig

Parameter

Value Type

Value Range

Description

modelInstanceType

std::string

  • "Standard"
  • "StandardMock"

Optional. The default value is Standard.

Model type.

  • Standard: standard inference
  • StandardMock: fake model (In this mode, the model is not loaded and only the server runs.)

modelName

String

The value can contain a maximum of 256 characters, including uppercase letters, lowercase letters, digits, hyphens (-), periods (.), and underscores (_). It cannot start or end with a hyphen (-), period (.), or underscore (_).

Mandatory. The default value is llama_65b.

Model name.

modelWeightPath

std::string

The maximum length of an absolute file path depends on the setting of the operating system (PATH_MAX in Linux). The minimum value is 1.

Mandatory. The default value is /data/atb_testdata/weights/llama1-65b-safetensors.

Path of the model weight file. The program reads the values of torch_dtype and vocab_size in the config.json file in the path. Ensure that the path and related fields exist.

Security verification is performed on the path. The owner group and permission of the path must be the same as those of the execution user.

worldSize

uint32_t

Set this parameter based on the actual situation of the model. The value of worldSize in each set of model parameters must be the same as the number of NPUs in use.

Mandatory. The default value is 4.

Number of inference cards to be used.

  • This parameter is invalid in distributed multi-node inference scenarios. The value of worldSize is calculated based on ranktable.
  • In prefill-decode disaggregation inference scenarios, the value must be the same as the number of cards set for role delivery.

cpuMemSize

uint32_t

The upper limit is determined by the graphics memory and user requirements. This parameter can be set to 0 only when maxPreemptCount is set to 0.

Mandatory. The default and recommended value is 5 GB.

Maximum size of the KV cache that can be allocated on a CPU.

npuMemSize

int32_t

  • -1
  • Integer in the range of (0, 2147483647]

Mandatory. The default and recommended value is -1, in GB.

Maximum size of the KV cache that can be allocated on an NPU.

  • Automatic allocation of KV cache: If the value is -1, the KV cache is automatically allocated based on the available graphics memory.

    Formula for calculating the KV cache: npuMemSize = Total memory of a single NPU × Memory allocation ratioMemory occupied by weights of a single NPUMemory occupied by variables during runningMemory occupied by the system.

    • Total memory of a single NPU: Run the npu-smi info command to view the total graphics memory.
    • Memory allocation ratio: The default value is 0.8, which can be controlled by the environment variable NPU_MEMORY_FRACTION. When OOM occurs during weight loading, you can increase the allocation ratio or use more NPUs for inference.
    • Memory occupied by weights of a single NPUWeight size × Type size (2 for the floating-point type; 1 for the int8 type)/Number of NPUs. Refer to the actual weights.
    • Memory occupied by variables during running refers to the memory occupied by model input variables, output variables, intermediate variables, and other variables.
    • Memory occupied by the system: You can run the npu-smi info command to view the graphics memory used in the static state.
  • Manual allocation of the KV cache: If the value is greater than 0, the KV cache size is fixed based on the configured value.
  • In the current version, some performance optimization algorithms may increase the device memory usage. If you have set npuMemSize to a fixed value in an earlier version and OOM occurs during service running after the version is updated, you are advised to change the value of npuMemSize to -1 or a smaller value.
NOTE:
  • For a multimodal model, npuMemSize cannot be set to -1 because space needs to be reserved for ViT. You can calculate the value of npuMemSize based on the following formula and round up the result: 4 × num_hidden_layers × num_key_value_heads × (hidden_size/num_attention_heads) × (maxPrefillBatchSize × maxSeqLen)/worldSize/(1024 × 1024 × 1024). In the formula, num_hidden_layers, num_key_value_heads, hidden_size, and num_attention_heads are parameters in the config.json configuration file in the weight path.
  • When backendType is set to ms, npuMemSize=-1 supports only ParallelLlamaForCausalLM in prefill-decode mixed deployment.
  • A formula is provided to quickly determine the optimal value range of the graphics memory. The result obtained using the formula is for reference only. To achieve the best performance, you can increase the value and perform a performance stress test.
  • If the value of npuMemSize exceeds the maximum graphics memory that can be allocated by the system, exceptions such as the inference service startup failure or suspension may occur. In this case, you need to decrease the value and try again.
  • In the prefill-decode disaggregation deployment scenario, this parameter can be set to -1 only when backendType is set to atb.

backendType

std::string

  • "atb"
  • "ms"

Mandatory. The default value is atb.

Backend type.

  • atb: ATB acceleration library
  • ms: MindSpore
NOTE:

If ms is selected as the inference engine backend, you need to install MindSpore and MindFormers and modify the MindIE startup configuration in advance. For details, click here.

trustRemoteCode

Bool

  • true
  • false

Optional. The default value is false.

Whether to trust remote code.

  • false: Remote code is not trusted.
  • true: Remote code is trusted.
NOTE:

If this parameter is set to true, remote code is trusted, which may cause malicious code injection risks. You need to guarantee code injection security.

async_scheduler_wait_time

int32_t

Integer in the range of [1, 3600]

Optional. The default value is 120, in seconds.

Waiting time for asynchronous scheduling, which can be configured when the asynchronous scheduling function is enabled.

kv_trans_timeout

int32_t

The upper limit is determined by the graphics memory and user requirements. If the value is less than or equal to 0, it is automatically changed to 1.

Timeout interval for the decode node to pull the KV cache from the prefill node in the prefill-decode disaggregation scenario. This parameter needs to be set only on the decode node. The default value is 10, in seconds.

  • This parameter is used only in the prefill-decode disaggregation scenario. In other scenarios, this parameter does not take effect.
  • It is recommended that the value be greater than the number of network packet retransmissions multiplied by the timeout interval for each retransmission.
  • When setting this parameter, pay attention to environment variables HCCL_RDMA_RETRY_CNT and HCCL_RDMA_TIMEOUT. For details, see "Cluster Service Deployment" > "Prefill-Decode Disaggregation" > "Installation and Deployment" > "Deploying a Single-Node Prefill-Decode Disaggregation Service Using kubectl" in MindIE Motor Development Guide.

kv_link_timeout

int32_t

The upper limit is determined by the graphics memory and user requirements. If the value is less than or equal to 0, it is automatically changed to the default value 1080.

Timeout interval for establishing a communicator for KV cache transmission in the prefill-decode disaggregation scenario. If the communicator is not created within the timeout, the system retries until it succeeds or the timeout expires. The default and recommended value is 1080 seconds.

  • This parameter is used only in the prefill-decode disaggregation scenario. In other scenarios, this parameter does not take effect.
  • If there is no network issue, you do not need to change the default value. If the cluster scale is small and the communicator fails to be established due to a network fault, you can reduce the timeout interval for quick debugging.

Parameters in ScheduleConfig

Due to the adjustment of the recomputation scheduling policy, the performance of different versions may fluctuate under the same scheduling parameters. For details about how to obtain the optimal performance, see Performance Tuning.

Parameter

Value Type

Value Range

Description

templateType

std::string

  • "Standard"
  • "Mix"

Mandatory. The default value is Standard.

Inference type.

  • Standard: In the prefill-decode mixed deployment scenario, the prefill and decode requests are grouped in different batches.
  • Mix: parameter related to the SplitFuse feature. The prefill and decode requests can be grouped into batches together.

This field does not take effect in prefill-decode disaggregation scenarios.

templateName

std::string

The value can only be Standard_LLM.

Mandatory. The default value is Standard_LLM.

Name of a scheduling workflow.

cacheBlockSize

uint32_t

[1, 128]

Size of a KV cache block, in tokens.

Mandatory. The default and recommended value is 128. For other values, they must be the nth power of 2.

maxPrefillBatchSize

uint32_t

[1, maxBatchSize]

Mandatory. The default value is 50.

Maximum prefill batch size. A batch is grouped when either maxPrefillBatchSize or maxPrefillTokens reaches its value.

This parameter is used when the batch size in the prefill phase needs to be limited. If the batch size does not need to be limited, you can set this parameter to 0 (the engine uses the value of maxBatchSize by default) or a value the same as that of maxBatchSize.

maxPrefillTokens

uint32_t

[1, 4194304]. The value must be greater than or equal to the value of maxInputTokenLen.

Mandatory. The default value is 8192.

During each prefill, the total number of input tokens in the current batch cannot exceed the value of maxPrefillTokens. A batch is grouped when either maxPrefillTokens or maxPrefillBatchSize reaches its value.

You are advised not to set this parameter to a large value. If the graphics memory overflows, you can set this parameter to a smaller value.

prefillTimeMsPerReq

uint32_t

[0, 1000]

Mandatory. The default value is 150.

This parameter is used together with decodeTimeMsPerReq to determine whether prefill or decode should be selected for the next inference. This parameter is valid only when supportSelectBatch is set to true. The unit is ms. For details about the scheduling policy process, see Figure 1.

  • In the prefill-decode mixed deployment scenario, the following parameters are calculated to determine whether prefill or decode is selected for the next inference:
    • prefillWaitTime = prefillTimeMsPerReq * decodeReqNum: waiting time of a decode operation if prefill is selected.
    • accumulatedDecodeWasteTime = accumulatedDecodeWasteTime + decodeTimeMsPerReq * (maxBatchSize - decodeReqNum): time wasted for multiple consecutive decode operations.

    Compare the calculation results to determine the next inference operation:

    • prefillWaitTime > accumulatedDecodeWasteTime: Too many decoding requests are stacked, and the next inference phase is decode.
    • prefillWaitTime ≤ accumulatedDecodeWasteTime: The time wasted for multiple consecutive decode operations is too long, and the next inference phase is prefill.
  • This parameter is invalid in the prefill-decode disaggregation scenario. If the current instance is a prefill instance, prefill computing is performed first. If the current instance is a decode instance, decode computing is performed first.

prefillPolicyType

uint32_t

0

Mandatory. The default value is 0.

Scheduling policy in the prefill phase. For details about the scheduling policy process, see Figure 2.

0: FCFS (first-come first-served)

decodeTimeMsPerReq

uint32_t

[0, 1000]

Mandatory. The default value is 50.

This parameter is used together with prefillTimeMsPerReq to determine whether prefill or decode should be selected for the next inference. This parameter is valid only when supportSelectBatch is set to true. The unit is ms. For details about the scheduling policy process, see Figure 1.

  • In the prefill-decode mixed deployment scenario, the following parameters are calculated to determine whether prefill or decode is selected for the next inference:
    • prefillWaitTime = prefillTimeMsPerReq * decodeReqNum: waiting time of a decode operation if prefill is selected.
    • accumulatedDecodeWasteTime = accumulatedDecodeWasteTime + decodeTimeMsPerReq * (maxBatchSize - decodeReqNum): time wasted for multiple consecutive decode operations.

    Compare the calculation results to determine the next inference operation:

    • prefillWaitTime > accumulatedDecodeWasteTime: Too many decoding requests are stacked, and the next inference phase is decode.
    • prefillWaitTime ≤ accumulatedDecodeWasteTime: The time wasted for multiple consecutive decode operations is too long, and the next inference phase is prefill.
  • This parameter is invalid in the prefill-decode disaggregation scenario. If the current instance is a prefill instance, prefill computing is performed first. If the current instance is a decode instance, decode computing is performed first.

decodePolicyType

uint32_t

0

Mandatory. The default value is 0.

Scheduling policy in the decode phase. For details about the scheduling policy process, see Figure 2.

0: FCFS (first-come first-served)

maxBatchSize

uint32_t

[1, 5000]. The value must be greater than or equal to the value of maxPreemptCount.

NOTE:

For the Atlas 300I Duo inference card, the value range is [1, 2000].

Mandatory. The default value is 200.

Maximum decode batch size.

  1. Calculate block_num: Total Block Num = Floor(NPU memory/(Number of model layers × cacheBlockSize × Number of model attention heads × Attention head size × Number of cache bytes × Number of caches). The number of caches is 2. In tensor parallel mode, block_num × world_size is the actual number of allocated blocks.

    If there are multiple cards, the value of Number of model attention heads × Attention head size in the formula needs to be evenly distributed to each card, that is, Number of model attention heads × Attention head size/Number of cards.

    In the formula, Floor indicates that the calculation result is rounded down.

  2. Calculate the number of blocks allocated for each request: Block Num = Ceil(Number of input tokens/cacheBlockSize) + Ceil(Maximum number of output tokens/cacheBlockSize) Number of input tokens is the number of token IDs after tokenization is performed on the input (character string). Maximum number of output tokens is the smaller value between the maximum number of iterations for model inference and the maximum output length.

    In the formula, Ceil indicates that the calculation result is rounded up.

  3. maxBatchSize = Total Block Num/Block Num

maxIterTimes

uint32

[1, maxSeqLen]

Mandatory. The default value is 512.

Global maximum output length of the model.

  • Maximum output length of a request: maxOutputLen = min(maxIterTimes, max_tokens) or maxOutputLen = min(maxIterTimes, max_new_tokens)
  • Actual output length of a request: outputLen = min(maxSeqLen – inputLen, maxOutputLen)

maxPreemptCount

uint32_t

[0, maxBatchSize]. If the value is greater than 0, the value of cpuMemSize cannot be 0.

Mandatory. The default value is 0.

Maximum number of requests that can be preempted in a batch, that is, the maximum number of requests that can be preempted in a round of scheduling. The maximum value is maxBatchSize. If the value is greater than 0, preemption is enabled.

supportSelectBatch

Bool

  • true
  • false

Mandatory. The default value is false.

Batch selection policy.

This field does not take effect in prefill-decode disaggregation scenarios.

  • false: Requests in the prefill phase are scheduled and executed first in each round of scheduling.
  • true: During each round of scheduling, the sequence of request scheduling and execution in the prefill and decode phases is adaptively adjusted based on the number of prefill and decode requests.

maxQueueDelayMicroseconds

uint32_t

[500, 1000000]

Mandatory. The default value is 5000.

Maximum waiting time of a request in the queue before the number of requests in the queue reaches the maximum value of maxBatchSize, maxPrefillBatchSize, or maxPrefillTokens. The unit is μs.

As long as the waiting time reaches the value of this parameter, the next inference is performed even if the number of requests does not reach the maximum value of maxBatchSize, maxPrefillBatchSize, or maxPrefillTokens.

maxFirstTokenWaitTime

uint32_t

[0, 3600000]

Optional. The default value is 2500 ms.

Maximum queuing time after a request arrives. After the waiting time reaches the value of this parameter, the current round of scheduling are allowed to preempt requests to reduce the Time to First Token (TTFT).

This field does not take effect in prefill-decode disaggregation scenarios.

Parameters in LogConfig

Parameter

Value Type

Value Range

Description

dynamicLogLevel

String

  • critical
  • error
  • warn
  • info
  • debug

The parameter value is case insensitive.

Dynamic log level.

Optional. This parameter is left empty by default.

For details about log levels, see "Setting the Log Level" in MindIE Log Reference.

dynamicLogLevelValidHours

uint32_t

[1, 168]

Duration for a dynamic log to take effect.

Mandatory. The default value is 2 hours.

The effective time is the start time specified by dynamicLogLevelValidTime plus the time specified by this parameter. When the configured duration elapses, the values of dynamicLogLevel, dynamicLogLevelValidHours, and dynamicLogLevelValidTime are automatically restored to the default values.

dynamicLogLevelValidTime

String

-

Start time of a dynamic log.

Optional. This parameter is left empty by default.

After the value of dynamicLogLevel or dynamicLogLevelValidHours is changed, the system automatically sets this parameter to the current modification time.

Figure 1 Scheduling policy and execution sequence
Figure 2 Process of scheduling policies in the prefill and decode phases