BIOS Configuration

This section describes BIOS configuration procedures designed to optimize performance on the Ascend AI Processor.

Modifying the CPU Power Consumption Mode

Principle: Generally, the Ascend AI Processor provides two power policy configurations: Efficiency and Performance.

  • Efficiency indicates the power saving mode. The CPU supports dynamic frequency and voltage scaling and can dynamically adjust the working frequency based on the load.
  • Performance indicates the performance mode. The CPU does not support dynamic frequency scaling and runs at the maximum frequency.

Optimization configuration: Power Policy is set to Performance for better performance.

Disadvantage: Enabling the high-performance mode will trigger high power consumption.

Recommended scenario: In inference scenarios, you are advised to enable the performance mode to improve CPU performance and reduce CPU idle cycles, which improves model performance.

To study how the high-performance function impacts inference performance, the lab utilizes an Atlas 800T A2 server running openEuler 22.03 SP4. Within this environment, the Llama-7B and Qwen2-7B models are deployed to provide a comparative analysis of execution performance before and after the high-performance function is enabled. The following table shows the test data based on MindIE. Because software and hardware configurations vary, the following test data is intended for reference only and should not be considered a performance standard.

Table 1 Experiment data

Model

Concurrency

Input Length

Experiment No.

Default: Efficiency

(Tokens/s)

Performance

(Tokens/s)

Performance Gains (%)

Llama-7B

8

128

Experiment 1

75.4373

76.1809

0.99

Experiment 2

75.3953

76.0922

0.92

Experiment 3

75.4051

76.0719

0.88

Average value

75.4126

76.1150

0.93

8

256

Experiment 1

76.5359

77.5444

1.32

Experiment 2

76.5362

77.3321

1.04

Experiment 3

77.0832

77.9778

1.16

Average value

76.7184

77.6181

1.17

Qwen2-7B

8

128

Experiment 1

83.5893

84.6158

1.23

Experiment 2

83.4479

84.4310

1.18

Experiment 3

83.3766

84.3732

1.20

Average value

83.4713

84.4733

1.20

8

256

Experiment 1

84.7990

85.8575

1.25

Experiment 2

84.7267

85.5879

1.02

Experiment 3

84.8510

86.2523

1.65

Average value

84.7922

85.8992

1.31

Configuration method: Access the BIOS through the BMC and set Power Policy to Performance under Advanced > Performance Config > Power Policy.

Figure 1 Example configuration

Modifying the Memory Refresh Rate

Principle: The DRAM uses a capacitor to store data. Due to electric leakage of the capacitor, charges are discharged after a period of time. As a result, the data cannot be stored for a long time. Therefore, continuous charging is required. This is called a refresh operation. The refresh operation and read/write operation cannot be performed at the same time. That means that the refresh operation affects the memory performance. The BIOS includes an Auto setting for the memory refresh rate. This feature automatically adjusts the rate based on the current operating temperature, offering superior memory performance compared to the default 32 ms configuration.

Optimization configuration: The memory refresh rate is set to Auto.

Recommended scenario: The memory refresh rate is dynamically adjusted, which can improve memory copy performance and model performance.

To study how different memory refresh rates impact the inference performance, the lab utilizes an Atlas 800T A2 server running openEuler 22.03 SP4. Within this environment, the Llama-7B and Qwen2-7B models are deployed to provide a comparative analysis of execution performance. The following table shows the test data based on MindIE. Because software and hardware configurations vary, the following test data is intended for reference only.

Table 2 Experiment data

Model

Concurrency

Input Length

Request Count

Experiment No.

Default: 32 ms

(Tokens/s)

auto

(Tokens/s)

Performance Gains (%)

Llama-7B

8

128

2000

Experiment 1

75.4373

80.4640

6.66

Experiment 2

75.3953

80.3319

6.55

Experiment 3

75.4051

80.4814

6.73

Average value

75.4126

80.4258

6.65

8

256

2000

Experiment 1

76.5359

81.8636

6.96

Experiment 2

76.5362

81.8073

6.89

Experiment 3

77.0832

81.7618

6.07

Average value

76.7184

81.8109

6.64

Qwen2-7B

8

128

2000

Experiment 1

83.5893

87.0385

4.13

Experiment 2

83.4479

87.1340

4.42

Experiment 3

83.3766

86.9090

4.24

Average value

83.4713

87.0272

4.26

8

256

2000

Experiment 1

84.7990

87.9068

3.66

Experiment 2

84.7267

87.8995

3.74

Experiment 3

84.8510

87.8859

3.58

Average value

84.7922

87.8974

3.66

Configuration method: Access the BIOS through the BMC and set Custom Refresh Rate to Auto under Advanced > Memory Config > Custom Refresh Rate.

Figure 2 Example configuration

Modifying the CPU Prefetching Configuration

Principle: When reading data from the memory to the high-speed cache of the CPU, the CPU not only reads the data to be accessed this time, but also prefetches the surrounding data of the current data item to the cache according to the locality principle. If the prefetched data is the data to be obtained next time, the performance is improved.

Optimization configuration: CPU prefetching is enabled.

Disadvantage: In scenarios where data is centralized and the prefetch hit ratio remains high, you are advised to enable CPU prefetching. Otherwise, you need to disable CPU prefetching.

Recommended scenario: In inference scenarios, you are advised to enable CPU prefetching to improve the CPU data read performance and model performance.

To study how the CPU prefetching function impacts inference performance, the lab utilizes an Atlas 800T A2 server running openEuler 22.03 SP4. Within this environment, the Llama-7B and Qwen2-7B models are deployed to provide a comparative analysis of execution performance before and after the CPU prefetching function is enabled. The following table shows the test data based on MindIE. Because software and hardware configurations vary, the following test data is intended for reference only.

Table 3 Experiment data

Model

Concurrency

Input Length

Experiment No.

CPU Prefetching Enabled by Default

(Tokens/s)

CPU Prefetching Disabled

(Tokens/s)

Performance Gains (%)

Llama-7B

8

128

Experiment 1

75.4373

74.7984

-0.85

Experiment 2

75.3953

74.6472

-0.99

Experiment 3

75.4051

74.6587

-0.99

Average value

75.4126

74.7014

-0.94

8

256

Experiment 1

76.5359

75.9913

-0.71

Experiment 2

76.5362

75.9398

-0.78

Experiment 3

77.0832

76.4258

-0.85

Average value

76.7184

76.1190

-0.78

Qwen2-7B

8

128

Experiment 1

83.5893

82.1724

-1.70

Experiment 2

83.4479

81.9364

-1.81

Experiment 3

83.3766

82.0396

-1.60

Average value

83.4713

82.0495

-1.70

8

256

Experiment 1

84.7990

83.6620

-1.34

Experiment 2

84.7267

83.0590

-1.97

Experiment 3

84.8510

83.9055

-1.11

Average value

84.7922

83.5422

-1.47

Configuration method: Access the BIOS through the BMC and set CPU Prefetching Configuration to Enabled under Advanced > MISC Config > CPU Prefetching Configuration.

Figure 3 Example configuration

Disabling SMMU

Principle: The SMMU implements the conversion from virtual addresses to physical addresses. However, the SMMU may increase additional overhead and latency, which deteriorates the system performance.

Optimization configuration: SMMU is disabled.

Recommended scenario: SMMU is disabled in non-VM scenarios (bare metal and Docker) to improve model performance.

To study how the SMMU impacts inference performance, the lab utilizes an Atlas 800T A2 server running openEuler 22.03 SP4. Within this environment, the Llama-7B and Qwen2-7B models are deployed to provide a comparative analysis of execution performance before and after the SMMU is enabled. The following table shows the test data based on MindIE. Because software and hardware configurations vary, the following test data is intended for reference only.

Table 4 Experiment data

Model

Concurrency

Input Length

Experiment No.

SMMU Disabled by Default

(Tokens/s)

SMMU Enabled

(Tokens/s)

Performance Gains (%)

Llama-7B

8

128

Experiment 1

75.4373

74.9460

-0.65

Experiment 2

75.3953

74.7320

-0.88

Experiment 3

75.4051

75.0313

-0.50

Average value

75.4126

74.9031

-0.68

8

256

Experiment 1

76.5359

75.7005

-1.09

Experiment 2

76.5362

75.6460

-1.16

Experiment 3

77.0832

75.8298

-1.63

Average value

76.7184

75.7254

-1.29

Qwen2-7B

8

128

Experiment 1

83.5893

81.5155

-2.48

Experiment 2

83.4479

81.7802

-2.00

Experiment 3

83.3766

82.2530

-1.35

Average value

83.4713

81.8496

-1.94

8

256

Experiment 1

84.7990

82.8234

-2.33

Experiment 2

84.7267

82.7696

-2.31

Experiment 3

84.8510

82.7741

-2.45

Average value

84.7922

82.7890

-2.36

Configuration method: Access the BIOS through the BMC and set Support Smmu to Disabled under Advanced > MISC Config > Support Smmu.

Figure 4 Example configuration