ATB Models for Pure Models
Prerequisites
CANN, PyTorch, Torch-NPU, and ATB Models have been installed in the environment. For details, see MindIE Installation Guide.
The following installation path is used as an example:
Install ATB Models and initialize ATB Models environment variables. The ${ATB_SPEED_HOME_PATH} environment variable initialization is contained in the set_env.sh script of the model repository. Therefore, sourcing the set_env.sh script from the model repository also initializes the ${ATB_SPEED_HOME_PATH} environment variable.
Constraints
- If model initialization fails due to user-defined modifications when ATB Models is used for inference, manually end the process.
- When ATB Models is used for inference, ensure that other users do not have the write permission on the weight path and file.
README Document Interpretation
Currently, ATB Models provides three types of README documents to help you understand the inference process and the features supported by models, and offer basic commissioning and fault locating methods.

Document |
Description |
Content |
|---|---|---|
${ATB_SPEED_HOME_PATH}/README.md |
General entry for all ATB Models documents. |
|
${ATB_SPEED_HOME_PATH}/examples/models/{Model_name}/README.md |
Document of each model in ATB Models. For example, ${ATB_SPEED_HOME_PATH}/examples/models/llama/README.md describes the Llama and Llama 2 models and provides operation guidance. |
|
${ATB_SPEED_HOME_PATH}/examples/README.md |
Introduction to common capabilities and interfaces. |
|
Example
The following uses LLaMA3-8B as an example to describe how to perform dialog inference and performance test.
- Configure environment variables.
1 2 3 4 5 6
# Configure the CANN environment. By default, the CANN is installed in the /usr/local directory. source /usr/local/Ascend/cann/set_env.sh # Configure the ATB environment. source /usr/local/Ascend/nnal/atb/set_env.sh # Configure the model repository environment variables. source /usr/local/Ascend/atb-models/set_env.sh
- Download model weights from the Hugging Face official website and save the downloaded weight file in /data/Llama-3-8b.
- Run the following command to change the permission on the weight file:
chmod -R 755 /data/Llama-3-8b
- (Optional) Convert the weight file format. Currently, only the weight files in safetensor format can be loaded for ATB Models inference. If the downloaded weight file is in safetensor format, you do not need to convert the format. If the downloaded weight file is in bin format, perform the following operations:
1 2 3 4
# Go to the path of ATB-Models. cd ${ATB_SPEED_HOME_PATH} # Run the script to generate the weights in safetensor format. python examples/convert/convert_weights.py --model_path /data/Llama-3-8b
The output result is saved in the same directory where the weight file in bin format is saved.
- Test dialog inference.
1 2
cd ${ATB_SPEED_HOME_PATH} bash examples/models/llama/run_pa.sh /data/Llama-3-8b
The run_pa.sh script called by the preceding commands is the encapsulation of the run_pa.py script. The default inference content is "What's deep learning?" and the batch size is 1. You can modify the inference content by referring to 6.
- Customize the inference content.
- You can directly call the run_pa.py script and customize the inference content and inference mode by passing parameters.
For example, if the weights in the /data/Llama-3-8b path are used for 8-device inference of "What's deep learning?" and "Hello World," the batch size is 2.
1 2 3 4
# Specify the available logical NPU cores on the current host. Use commas (,) to separate multiple cores. export ASCEND_RT_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 # Start inference. torchrun --nproc_per_node 8 --master_port 20030 -m examples.run_pa --model_path /data/Llama-3-8b --input_texts "What's deep learning?" "Hello World." --max_batch_size 2
- You can pass token IDs to perform inference.
Create a .py script (for example, test.py) to generate token IDs.
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained( pretrained_model_name_or_path="{Path where tokenizer is located}", use_fast=False, padding_side='left', trust_remote_code="{Set the value yourself}") inputs = tokenizer("What's deep learning?", return_tensors="pt") token_id = inputs.data["input_ids"] print(token_id)Run the following command to generate token IDs:
python test.py
Run the following command to start inference. In the following example, the token ID 1,15043,2787 corresponds to the first inference content and the token ID 1,306,626,2691 corresponds to the second inference content. The inference content is separated by spaces.# Start inference. torchrun --nproc_per_node 8 --master_port 20030 -m examples.run_pa --model_path /data/Llama-3-8b --input_ids 1,15043,2787 1,306,626,2691 --max_batch_size 2
Table 2 Parameters in the run_pa.py script Parameter
Mandatory (Yes/No)
Type
Default Value
Description
--model_path
Yes
string
""
Path of the model weight file.
Security verification is performed on this path, which must be an absolute path and have the same owner group and permission as the user who starts inference.
--input_texts
No
string
"What's deep learning?"
Inference text or inference text path. Multiple inference texts are separated by spaces.
--input_ids
No
string
None
Token ID list obtained after the inference text is processed by the model tokenizer. Multiple inference requests are separated by spaces. Each token in a single inference request is separated by a comma (,).
--input_file
No
string
None
Only JSONL files are supported. Each line must be dialog data sorted by time in List[Dict] format. Each dictionary must contain at least the role and content fields.
--input_dict
No
parse_list_of_json
None
Inference text and the corresponding adapter name. Format example: '[{"prompt": "A robe takes 2 bolts of blue fiber and half that much white fiber. How many bolts in total does it take?", "adapter": "adapter1"}, {"prompt": "What is deep learning?", "adapter": "base"}]'
--max_prefill_batch_size
No
int or None
None
Maximum prefill batch size for model inference.
--max_position_embeddings
No
int or None
None
Maximum context length supported by the model. If this parameter is set to None, the length is read from the model weight file.
--max_input_length
No
int
1024
Maximum number of tokens in the inference text.
--max_output_length
No
int
20
Maximum number of tokens in the inference result.
--max_prefill_tokens
No
int
-1
Maximum number of tokens that are supported in the prefill phase for model inference. If the input is -1, max_prefill_tokens = max_batch_size × (max_input_length + max_output_length)
--max_batch_size
No
int
1
Maximum batch size for model inference.
--block_size
No
int
128
Maximum number of tokens stored in each KV cache block. The default value is 128.
--chat_template
No
string or None
None
Prompt template of the dialog model.
--ignore_eos
No
bool
store_true
Whether to end the inference when an EOS token (sentence end identifier) is encountered in the inference result. If this parameter is passed, the EOS token is ignored.
--is_chat_model
No
bool
store_true
Whether to support the dialog mode. If this parameter is passed, the dialog mode is entered.
--is_embedding_model
No
bool
store_true
Whether the model is an embedding model. By default, the model is a causal inference model. If this parameter is passed, the model is an embedding model.
--load_tokenizer
No
bool
True
Whether to load the tokenizer. If False is passed, input_ids is necessary, and the inference output is token ID.
--enable_atb_torch
No
bool
store_true
Whether to use the Python graph. By default, the C++ graph is used. If this parameter is passed, the Python graph is used.
--kw_args
No
string
""
Extended parameter, which can be used to extend functions.
--trust_remote_code
No
bool
store_true
Whether to trust the custom code file in the model weight path. This operation is not executed by default. If this parameter is passed, Transformers will execute the custom code files in the model weight path. The user is responsible for the security of these code files. Check the security in advance.
--dp
No
int
-1
Number of data parallel processes. By default, data parallelism is not performed.
--tp
No
int
-1
Number of tensor parallel processes on the entire network. If the value is -1, this number is the value of worldSize by default.
--sp
No
int
-1
Number of sequence parallel processes. By default, sequence parallelism is not performed. If sequence parallelism is enabled, the number of sequence parallel processes is generally the same as the number of tensor parallel processes.
--cp
No
int
-1
Number of text parallel processes. By default, text parallelism is not performed.
--moe_tp
No
int
-1
Number of tensor parallel processes in the MoE module of a sparse model. By default, the number is the same as the value of tp. If both tp and moe_tp are configured, the priority of moe_tp is higher than that of tp.
--moe_ep
No
int
-1
Number of expert parallel processes in the MoE module of a sparse model. By default, there is no expert parallel process.
--lora_modules
No
string
None
Name of the LoRA weight to be loaded and the corresponding LoRA weight path, for example, '{"adapter1": "/path/to/lora1", "adapter2": "/path/to/lora2"}'. By default, the LoRA weight is not loaded.
--max_loras
No
int
0
Maximum number of LoRAs that can be stored in LoRA scenarios. This parameter is mandatory in dynamic LoRA scenarios and optional in static LoRA scenarios. If the input value is too large, an out_of_memory error is reported because too much weight space is reserved. For example: "RuntimeError: NPU out of memory. Tried to allocate xxx GiB."
--max_lora_rank
No
int
0
Maximum LoRA rank in dynamic LoRA loading and unloading scenarios. This parameter is mandatory in dynamic LoRA scenarios and optional in static LoRA scenarios. If the input value is too large, an out_of_memory error is reported because too much weight space is reserved. For example: "RuntimeError: NPU out of memory. Tried to allocate xxx GiB."
The run_pa.py script in this section is used for quick pure model test. No strong verification is added to the script. If an exception occurs, an exception message will be thrown. For example:
- input_texts, input_ids, input_file, and input_dict contain the inference content. The data processing time of the program is in direct proportion to the input data volume. These inputs are converted into token IDs and transferred to the NPU. If the input data volume is too large, the NPU tensors may occupy too much memory. As a result, an error message, for example, "req: xx input length: xx is too long, max_prefill_tokens: xx", is displayed due to out of memory.
- chat_template supports two types of inputs: template text or template file path. When you input a long template text, the system may run slowly.
- The script allocates the inference input and KV cache based on parameters such as max_batch_size, max_input_length, max_output_length, max_prefill_batch_size, and max_prefill_tokens. If the input value is too large, an out of memory error may occur, for example, "RuntimeError: NPU out of memory. Tried to allocate xxx GiB.".
- The script allocates NPU tensors such as rotary position embedding and attention mask based on max_position_embeddings. If the input value is too large, an out of memory error may occur, for example, "RuntimeError: NPU out of memory. Tried to allocate xxx GiB.".
- If block_size is less than the number of attention heads allocated to each device in tensor parallelism mode, an error ("Setup fail, enable log: export ASDOPS_LOG_LEVEL=ERROR, export ASDOPS_LOG_TO_STDOUT=1 to find the first error. For more details, see the MindIE official document.") is reported due to shape mismatch. In this case, you need to enable the log function to view details.
- You can directly call the run_pa.py script and customize the inference content and inference mode by passing parameters.
- Test the performance.After the ATB_LLM_BENCHMARK_ENABLE environment variable is enabled, the first token, incremental token, and end-to-end inference latency of the model are collected.
# Enable the environment variable. export ATB_LLM_BENCHMARK_ENABLE=1 # Start inference. For details, see step 4 and step 5.
The time consumption result is displayed on the terminal and saved in the ./benchmark_result/benchmark.csv file.