BIOS Configuration
This section describes BIOS configuration procedures designed to optimize performance on the Ascend AI Processor.
Modifying the CPU Power Consumption Mode
Principle: Generally, the Ascend AI Processor provides two power policy configurations: Efficiency and Performance.
- Efficiency indicates the power saving mode. The CPU supports dynamic frequency and voltage scaling and can dynamically adjust the working frequency based on the load.
- Performance indicates the performance mode. The CPU does not support dynamic frequency scaling and runs at the maximum frequency.
Optimization configuration: Power Policy is set to Performance for better performance.
Disadvantage: Enabling the high-performance mode will trigger high power consumption.
Recommended scenario: In inference scenarios, you are advised to enable the performance mode to improve CPU performance and reduce CPU idle cycles, which improves model performance.
To study how the high-performance function impacts inference performance, the lab utilizes an Atlas 800T A2 server running openEuler 22.03 SP4. Within this environment, the Llama-7B and Qwen2-7B models are deployed to provide a comparative analysis of execution performance before and after the high-performance function is enabled. The following table shows the test data based on MindIE. Because software and hardware configurations vary, the following test data is intended for reference only and should not be considered a performance standard.
|
Model |
Concurrency |
Input Length |
Experiment No. |
Default: Efficiency (Tokens/s) |
Performance (Tokens/s) |
Performance Gains (%) |
|---|---|---|---|---|---|---|
|
Llama-7B |
8 |
128 |
Experiment 1 |
75.4373 |
76.1809 |
0.99 |
|
Experiment 2 |
75.3953 |
76.0922 |
0.92 |
|||
|
Experiment 3 |
75.4051 |
76.0719 |
0.88 |
|||
|
Average value |
75.4126 |
76.1150 |
0.93 |
|||
|
8 |
256 |
Experiment 1 |
76.5359 |
77.5444 |
1.32 |
|
|
Experiment 2 |
76.5362 |
77.3321 |
1.04 |
|||
|
Experiment 3 |
77.0832 |
77.9778 |
1.16 |
|||
|
Average value |
76.7184 |
77.6181 |
1.17 |
|||
|
Qwen2-7B |
8 |
128 |
Experiment 1 |
83.5893 |
84.6158 |
1.23 |
|
Experiment 2 |
83.4479 |
84.4310 |
1.18 |
|||
|
Experiment 3 |
83.3766 |
84.3732 |
1.20 |
|||
|
Average value |
83.4713 |
84.4733 |
1.20 |
|||
|
8 |
256 |
Experiment 1 |
84.7990 |
85.8575 |
1.25 |
|
|
Experiment 2 |
84.7267 |
85.5879 |
1.02 |
|||
|
Experiment 3 |
84.8510 |
86.2523 |
1.65 |
|||
|
Average value |
84.7922 |
85.8992 |
1.31 |
Configuration method: Access the BIOS through the BMC and set Power Policy to Performance under Advanced > Performance Config > Power Policy.
Modifying the Memory Refresh Rate
Principle: The DRAM uses a capacitor to store data. Due to electric leakage of the capacitor, charges are discharged after a period of time. As a result, the data cannot be stored for a long time. Therefore, continuous charging is required. This is called a refresh operation. The refresh operation and read/write operation cannot be performed at the same time. That means that the refresh operation affects the memory performance. The BIOS includes an Auto setting for the memory refresh rate. This feature automatically adjusts the rate based on the current operating temperature, offering superior memory performance compared to the default 32 ms configuration.
Optimization configuration: The memory refresh rate is set to Auto.
Recommended scenario: The memory refresh rate is dynamically adjusted, which can improve memory copy performance and model performance.
To study how different memory refresh rates impact the inference performance, the lab utilizes an Atlas 800T A2 server running openEuler 22.03 SP4. Within this environment, the Llama-7B and Qwen2-7B models are deployed to provide a comparative analysis of execution performance. The following table shows the test data based on MindIE. Because software and hardware configurations vary, the following test data is intended for reference only.
|
Model |
Concurrency |
Input Length |
Request Count |
Experiment No. |
Default: 32 ms (Tokens/s) |
auto (Tokens/s) |
Performance Gains (%) |
|---|---|---|---|---|---|---|---|
|
Llama-7B |
8 |
128 |
2000 |
Experiment 1 |
75.4373 |
80.4640 |
6.66 |
|
Experiment 2 |
75.3953 |
80.3319 |
6.55 |
||||
|
Experiment 3 |
75.4051 |
80.4814 |
6.73 |
||||
|
Average value |
75.4126 |
80.4258 |
6.65 |
||||
|
8 |
256 |
2000 |
Experiment 1 |
76.5359 |
81.8636 |
6.96 |
|
|
Experiment 2 |
76.5362 |
81.8073 |
6.89 |
||||
|
Experiment 3 |
77.0832 |
81.7618 |
6.07 |
||||
|
Average value |
76.7184 |
81.8109 |
6.64 |
||||
|
Qwen2-7B |
8 |
128 |
2000 |
Experiment 1 |
83.5893 |
87.0385 |
4.13 |
|
Experiment 2 |
83.4479 |
87.1340 |
4.42 |
||||
|
Experiment 3 |
83.3766 |
86.9090 |
4.24 |
||||
|
Average value |
83.4713 |
87.0272 |
4.26 |
||||
|
8 |
256 |
2000 |
Experiment 1 |
84.7990 |
87.9068 |
3.66 |
|
|
Experiment 2 |
84.7267 |
87.8995 |
3.74 |
||||
|
Experiment 3 |
84.8510 |
87.8859 |
3.58 |
||||
|
Average value |
84.7922 |
87.8974 |
3.66 |
Configuration method: Access the BIOS through the BMC and set Custom Refresh Rate to Auto under Advanced > Memory Config > Custom Refresh Rate.
Modifying the CPU Prefetching Configuration
Principle: When reading data from the memory to the high-speed cache of the CPU, the CPU not only reads the data to be accessed this time, but also prefetches the surrounding data of the current data item to the cache according to the locality principle. If the prefetched data is the data to be obtained next time, the performance is improved.
Optimization configuration: CPU prefetching is enabled.
Disadvantage: In scenarios where data is centralized and the prefetch hit ratio remains high, you are advised to enable CPU prefetching. Otherwise, you need to disable CPU prefetching.
Recommended scenario: In inference scenarios, you are advised to enable CPU prefetching to improve the CPU data read performance and model performance.
To study how the CPU prefetching function impacts inference performance, the lab utilizes an Atlas 800T A2 server running openEuler 22.03 SP4. Within this environment, the Llama-7B and Qwen2-7B models are deployed to provide a comparative analysis of execution performance before and after the CPU prefetching function is enabled. The following table shows the test data based on MindIE. Because software and hardware configurations vary, the following test data is intended for reference only.
|
Model |
Concurrency |
Input Length |
Experiment No. |
CPU Prefetching Enabled by Default (Tokens/s) |
CPU Prefetching Disabled (Tokens/s) |
Performance Gains (%) |
|---|---|---|---|---|---|---|
|
Llama-7B |
8 |
128 |
Experiment 1 |
75.4373 |
74.7984 |
-0.85 |
|
Experiment 2 |
75.3953 |
74.6472 |
-0.99 |
|||
|
Experiment 3 |
75.4051 |
74.6587 |
-0.99 |
|||
|
Average value |
75.4126 |
74.7014 |
-0.94 |
|||
|
8 |
256 |
Experiment 1 |
76.5359 |
75.9913 |
-0.71 |
|
|
Experiment 2 |
76.5362 |
75.9398 |
-0.78 |
|||
|
Experiment 3 |
77.0832 |
76.4258 |
-0.85 |
|||
|
Average value |
76.7184 |
76.1190 |
-0.78 |
|||
|
Qwen2-7B |
8 |
128 |
Experiment 1 |
83.5893 |
82.1724 |
-1.70 |
|
Experiment 2 |
83.4479 |
81.9364 |
-1.81 |
|||
|
Experiment 3 |
83.3766 |
82.0396 |
-1.60 |
|||
|
Average value |
83.4713 |
82.0495 |
-1.70 |
|||
|
8 |
256 |
Experiment 1 |
84.7990 |
83.6620 |
-1.34 |
|
|
Experiment 2 |
84.7267 |
83.0590 |
-1.97 |
|||
|
Experiment 3 |
84.8510 |
83.9055 |
-1.11 |
|||
|
Average value |
84.7922 |
83.5422 |
-1.47 |
Configuration method: Access the BIOS through the BMC and set CPU Prefetching Configuration to Enabled under Advanced > MISC Config > CPU Prefetching Configuration.
Disabling SMMU
Principle: The SMMU implements the conversion from virtual addresses to physical addresses. However, the SMMU may increase additional overhead and latency, which deteriorates the system performance.
Optimization configuration: SMMU is disabled.
Recommended scenario: SMMU is disabled in non-VM scenarios (bare metal and Docker) to improve model performance.
To study how the SMMU impacts inference performance, the lab utilizes an Atlas 800T A2 server running openEuler 22.03 SP4. Within this environment, the Llama-7B and Qwen2-7B models are deployed to provide a comparative analysis of execution performance before and after the SMMU is enabled. The following table shows the test data based on MindIE. Because software and hardware configurations vary, the following test data is intended for reference only.
|
Model |
Concurrency |
Input Length |
Experiment No. |
SMMU Disabled by Default (Tokens/s) |
SMMU Enabled (Tokens/s) |
Performance Gains (%) |
|---|---|---|---|---|---|---|
|
Llama-7B |
8 |
128 |
Experiment 1 |
75.4373 |
74.9460 |
-0.65 |
|
Experiment 2 |
75.3953 |
74.7320 |
-0.88 |
|||
|
Experiment 3 |
75.4051 |
75.0313 |
-0.50 |
|||
|
Average value |
75.4126 |
74.9031 |
-0.68 |
|||
|
8 |
256 |
Experiment 1 |
76.5359 |
75.7005 |
-1.09 |
|
|
Experiment 2 |
76.5362 |
75.6460 |
-1.16 |
|||
|
Experiment 3 |
77.0832 |
75.8298 |
-1.63 |
|||
|
Average value |
76.7184 |
75.7254 |
-1.29 |
|||
|
Qwen2-7B |
8 |
128 |
Experiment 1 |
83.5893 |
81.5155 |
-2.48 |
|
Experiment 2 |
83.4479 |
81.7802 |
-2.00 |
|||
|
Experiment 3 |
83.3766 |
82.2530 |
-1.35 |
|||
|
Average value |
83.4713 |
81.8496 |
-1.94 |
|||
|
8 |
256 |
Experiment 1 |
84.7990 |
82.8234 |
-2.33 |
|
|
Experiment 2 |
84.7267 |
82.7696 |
-2.31 |
|||
|
Experiment 3 |
84.8510 |
82.7741 |
-2.45 |
|||
|
Average value |
84.7922 |
82.7890 |
-2.36 |
Configuration method: Access the BIOS through the BMC and set Support Smmu to Disabled under Advanced > MISC Config > Support Smmu.