BIOS Configuration

This section describes BIOS configuration procedures designed to optimize performance on the Ascend AI Processor.

Modifying the CPU Power Consumption Mode

Principle: Generally, the Ascend AI Processor provides two power policy configurations: Efficiency and Performance.

Efficiency indicates the power saving mode. The CPU supports dynamic frequency and voltage scaling and can dynamically adjust the working frequency based on the load.
Performance indicates the performance mode. The CPU does not support dynamic frequency scaling and runs at the maximum frequency.

Optimization configuration: Power Policy is set to Performance for better performance.

Disadvantage: Enabling the high-performance mode will trigger high power consumption.

Recommended scenario: In inference scenarios, you are advised to enable the performance mode to improve CPU performance and reduce CPU idle cycles, which improves model performance.

To study how the high-performance function impacts inference performance, the lab utilizes an Atlas 800T A2 server running openEuler 22.03 SP4. Within this environment, the Llama-7B and Qwen2-7B models are deployed to provide a comparative analysis of execution performance before and after the high-performance function is enabled. The following table shows the test data based on MindIE. Because software and hardware configurations vary, the following test data is intended for reference only and should not be considered a performance standard.

**Table 1** Experiment data
Model	Concurrency	Input Length	Experiment No.	Default: Efficiency (Tokens/s)	Performance (Tokens/s)	Performance Gains (%)
Llama-7B	8	128	Experiment 1	75.4373	76.1809	0.99
			Experiment 2	75.3953	76.0922	0.92
			Experiment 3	75.4051	76.0719	0.88
			Average value	75.4126	76.1150	0.93
	8	256	Experiment 1	76.5359	77.5444	1.32
			Experiment 2	76.5362	77.3321	1.04
			Experiment 3	77.0832	77.9778	1.16
			Average value	76.7184	77.6181	1.17
Qwen2-7B	8	128	Experiment 1	83.5893	84.6158	1.23
			Experiment 2	83.4479	84.4310	1.18
			Experiment 3	83.3766	84.3732	1.20
			Average value	83.4713	84.4733	1.20
	8	256	Experiment 1	84.7990	85.8575	1.25
			Experiment 2	84.7267	85.5879	1.02
			Experiment 3	84.8510	86.2523	1.65
			Average value	84.7922	85.8992	1.31

Configuration method: Access the BIOS through the BMC and set Power Policy to Performance under Advanced > Performance Config > Power Policy.

Figure 1 Example configuration

Modifying the Memory Refresh Rate

Principle: The DRAM uses a capacitor to store data. Due to electric leakage of the capacitor, charges are discharged after a period of time. As a result, the data cannot be stored for a long time. Therefore, continuous charging is required. This is called a refresh operation. The refresh operation and read/write operation cannot be performed at the same time. That means that the refresh operation affects the memory performance. The BIOS includes an Auto setting for the memory refresh rate. This feature automatically adjusts the rate based on the current operating temperature, offering superior memory performance compared to the default 32 ms configuration.

Optimization configuration: The memory refresh rate is set to Auto.

Recommended scenario: The memory refresh rate is dynamically adjusted, which can improve memory copy performance and model performance.

To study how different memory refresh rates impact the inference performance, the lab utilizes an Atlas 800T A2 server running openEuler 22.03 SP4. Within this environment, the Llama-7B and Qwen2-7B models are deployed to provide a comparative analysis of execution performance. The following table shows the test data based on MindIE. Because software and hardware configurations vary, the following test data is intended for reference only.

**Table 2** Experiment data
Model	Concurrency	Input Length	Request Count	Experiment No.	Default: 32 ms (Tokens/s)	auto (Tokens/s)	Performance Gains (%)
Llama-7B	8	128	2000	Experiment 1	75.4373	80.4640	6.66
				Experiment 2	75.3953	80.3319	6.55
				Experiment 3	75.4051	80.4814	6.73
				Average value	75.4126	80.4258	6.65
	8	256	2000	Experiment 1	76.5359	81.8636	6.96
				Experiment 2	76.5362	81.8073	6.89
				Experiment 3	77.0832	81.7618	6.07
				Average value	76.7184	81.8109	6.64
Qwen2-7B	8	128	2000	Experiment 1	83.5893	87.0385	4.13
				Experiment 2	83.4479	87.1340	4.42
				Experiment 3	83.3766	86.9090	4.24
				Average value	83.4713	87.0272	4.26
	8	256	2000	Experiment 1	84.7990	87.9068	3.66
				Experiment 2	84.7267	87.8995	3.74
				Experiment 3	84.8510	87.8859	3.58
				Average value	84.7922	87.8974	3.66

Configuration method: Access the BIOS through the BMC and set Custom Refresh Rate to Auto under Advanced > Memory Config > Custom Refresh Rate.

Figure 2 Example configuration

Modifying the CPU Prefetching Configuration

Principle: When reading data from the memory to the high-speed cache of the CPU, the CPU not only reads the data to be accessed this time, but also prefetches the surrounding data of the current data item to the cache according to the locality principle. If the prefetched data is the data to be obtained next time, the performance is improved.

Optimization configuration: CPU prefetching is enabled.

Disadvantage: In scenarios where data is centralized and the prefetch hit ratio remains high, you are advised to enable CPU prefetching. Otherwise, you need to disable CPU prefetching.

Recommended scenario: In inference scenarios, you are advised to enable CPU prefetching to improve the CPU data read performance and model performance.

To study how the CPU prefetching function impacts inference performance, the lab utilizes an Atlas 800T A2 server running openEuler 22.03 SP4. Within this environment, the Llama-7B and Qwen2-7B models are deployed to provide a comparative analysis of execution performance before and after the CPU prefetching function is enabled. The following table shows the test data based on MindIE. Because software and hardware configurations vary, the following test data is intended for reference only.

**Table 3** Experiment data
Model	Concurrency	Input Length	Experiment No.	CPU Prefetching Enabled by Default (Tokens/s)	CPU Prefetching Disabled (Tokens/s)	Performance Gains (%)
Llama-7B	8	128	Experiment 1	75.4373	74.7984	-0.85
			Experiment 2	75.3953	74.6472	-0.99
			Experiment 3	75.4051	74.6587	-0.99
			Average value	75.4126	74.7014	-0.94
	8	256	Experiment 1	76.5359	75.9913	-0.71
			Experiment 2	76.5362	75.9398	-0.78
			Experiment 3	77.0832	76.4258	-0.85
			Average value	76.7184	76.1190	-0.78
Qwen2-7B	8	128	Experiment 1	83.5893	82.1724	-1.70
			Experiment 2	83.4479	81.9364	-1.81
			Experiment 3	83.3766	82.0396	-1.60
			Average value	83.4713	82.0495	-1.70
	8	256	Experiment 1	84.7990	83.6620	-1.34
			Experiment 2	84.7267	83.0590	-1.97
			Experiment 3	84.8510	83.9055	-1.11
			Average value	84.7922	83.5422	-1.47

Configuration method: Access the BIOS through the BMC and set CPU Prefetching Configuration to Enabled under Advanced > MISC Config > CPU Prefetching Configuration.

Figure 3 Example configuration

Disabling SMMU

Principle: The SMMU implements the conversion from virtual addresses to physical addresses. However, the SMMU may increase additional overhead and latency, which deteriorates the system performance.

Optimization configuration: SMMU is disabled.

Recommended scenario: SMMU is disabled in non-VM scenarios (bare metal and Docker) to improve model performance.

To study how the SMMU impacts inference performance, the lab utilizes an Atlas 800T A2 server running openEuler 22.03 SP4. Within this environment, the Llama-7B and Qwen2-7B models are deployed to provide a comparative analysis of execution performance before and after the SMMU is enabled. The following table shows the test data based on MindIE. Because software and hardware configurations vary, the following test data is intended for reference only.

**Table 4** Experiment data
Model	Concurrency	Input Length	Experiment No.	SMMU Disabled by Default (Tokens/s)	SMMU Enabled (Tokens/s)	Performance Gains (%)
Llama-7B	8	128	Experiment 1	75.4373	74.9460	-0.65
			Experiment 2	75.3953	74.7320	-0.88
			Experiment 3	75.4051	75.0313	-0.50
			Average value	75.4126	74.9031	-0.68
	8	256	Experiment 1	76.5359	75.7005	-1.09
			Experiment 2	76.5362	75.6460	-1.16
			Experiment 3	77.0832	75.8298	-1.63
			Average value	76.7184	75.7254	-1.29
Qwen2-7B	8	128	Experiment 1	83.5893	81.5155	-2.48
			Experiment 2	83.4479	81.7802	-2.00
			Experiment 3	83.3766	82.2530	-1.35
			Average value	83.4713	81.8496	-1.94
	8	256	Experiment 1	84.7990	82.8234	-2.33
			Experiment 2	84.7267	82.7696	-2.31
			Experiment 3	84.8510	82.7741	-2.45
			Average value	84.7922	82.7890	-2.36

Configuration method: Access the BIOS through the BMC and set Support Smmu to Disabled under Advanced > MISC Config > Support Smmu.

Figure 4 Example configuration

Parent topic: High-Performance Configuration