Model Inference Acceleration

Configuration Description

TextEmbedding supports vectorization inference acceleration for BERT, RoBERTa, and XLM-RoBERTa embedding models using float16 precision. To use this function, install the operator module when installing the RAG SDK package and ensure that this function is enabled (disabled by default). For details, see Example (with Inference Acceleration Enabled).

The CLIP model acceleration supports only ViT-B-16, ViT-L-14, ViT-L-14-336 and ViT-H-14. After acceleration is enabled, graph compilation is performed during the initial inference, which takes about 1 to 2 minutes.

Configurations of model inference acceleration:

from modeling_bert_adapter import enable_bert_speed
from modeling_roberta_adapter import enable_roberta_speed
from modeling_xlm_roberta_adapter import enable_xlm_roberta_speed
from modeling_clip_adapter import enable_clip_speed

Set ENABLE_BOOST to True or False to activate/deactivate model inference acceleration.
```
os.environ["ENABLE_BOOST"] = "True"
```

Environment variables related to model acceleration logging

ATB_LOG_TO_STDOUT: The value 1 indicates logging to standard output.
ATB_LOG_TO_FILE: The value 1 indicates logging to a file.
ATB_LOG_LEVEL: Log level, which can be TRACE, DEBUG, INFO, WARN, ERROR, or FATAL.

For CLIP model inference on the Atlas 300I Duo inference card, the optimal batch size is less than or equal to 4. Larger batch sizes do not yield performance gains and even may negatively impact performance.

Binding CPU Cores to Improve Inference Performance

For Kunpeng servers, you can use numactl to bind cores to the program to improve inference performance.

Run the npu-smi info command to obtain the <bus-id> of the NPU.
Run the lspci -vs <bus-id> command to query the NUMA node of the NPU.
```
lspci -vs 0000:83:00.0
```
Run the lscpu command to obtain the number of CPU cores corresponding to the NUMA node.
```
lscpu | grep NUMA
```
Add numactl -C <Number of CPU cores> before program execution.
```
numactl -C 48-71 xxxx program
```

Example (with Inference Acceleration Enabled)

import os
import torch
import torch_npu
# Adapt to vectorized inference acceleration of BERT models.
from modeling_bert_adapter import enable_bert_speed
from mx_rag.embedding.local import TextEmbedding

# Enable vectorized inference acceleration (True: enabled; False: disabled).
os.environ["ENABLE_BOOST"] = "True"

device_id = 1
torch_npu.npu.set_device(f"npu:{device_id}")

embed = TextEmbedding(model_path="/path/to/model", dev_id=device_id)
print(embed.embed_documents(["What are the attractions in Beijing?"]))
print(embed.embed_query("What are the attractions in Beijing?"))

Parent topic: API Reference