Affinity-based Automatic CPU Core Binding Optimization

Principles

Compared with x86 servers, Arm servers usually have more CPU cores but weaker single-core performance. Therefore, the kernel load balancing policy is more likely to be triggered when Arm servers are used. This policy enables process migration to reduce pressure on busy processors. Process migration causes process context switching, reduces the cache hit ratio, and cross-NUMA memory access, affecting training performance.

You can run the project script of the affinity-based automatic CPU core binding tool without any modification. It automatically binds the training process to the CPU core on the Arm server, improving the CPU affinity of the training process.

Environment Setup

You have obtained the affinity-based automatic CPU core binding tool script bind_core.py and uploaded it to any directory in the environment, for example, $HOME/test. Click here to obtain bind_core.py.
The driver version must be 23.0.RC2 and later.

Application Scenarios

This problem usually occurs in multi-card scenarios.

During multi-card training, run the htop command to check the usage of each CPU core. The number in each column indicates the CPU core ID, and the progress indicates the usage of the current core. As shown in Figure 1, the usage of some CPUs reaches 100%, but the usage of some CPUs is low or not. The resource allocation is unbalanced.

Figure 1 CPU core usage

In this scenario, you can use the affinity-based automatic CPU core binding tool to implement CPU load balancing.

Procedure

The affinity-based automatic CPU core binding tool has two modes:

launch: When the core binding tool is started, the training job is invoked and core binding is implemented.

attach: The training job is started first and then the core binding tool to implement core binding.

launch mode

Go to the directory where the training script is stored and run the following command to run the core binding tool:
```
python3 $HOME/test/bind.py --application "bash xxx.sh"
```
--application or -app: training command. Replace it based on the actual situation. If application is specified, use double quotation marks ("") to enclose the parameter value.

After the core binding tool is executed, the training process is started and automatic core binding is implemented. After core binding is complete, the core binding result is printed to the specified file in the current directory. The file is automatically generated in the format of bind_core_xxxx_xx_xx_xx_xx_xx.log, where xxxx_xx_xx_xx_xx_xx indicates the timestamp. As shown in Figure 2, core binding is complete.

Figure 2 Core binding completed in launch mode
After the core binding tool is running, it automatically checks the current training process within 30 seconds. If the training process is not started within 30 seconds, an error is reported, causing the core binding failure. Use -t or --time to set the delay based on the data preprocessing time required for model training. The command is as follows:
```
python3 bind.py --time 60 --applicationbash "bash xxx.sh" # The core binding tool will start running in 60 seconds.
```
Run the following command to open the result file and view the core binding result, as shown in Figure 3:
```
cat bind_core_xxxx_xx_xx_xx_xx_xx.log
```
Figure 3 Core binding result in launch mode

attach mode

Go to the directory where the training script is stored and run the following command to start training:
```
bash xxx.sh
```
xxx.sh: training script. Replace it based on the actual situation.
Open a new window, go to the $HOME/test directory, and run the following command to start the core binding tool:
```
python3 bind_core.py
```
After the training process is complete, the core binding tool automatically stops running and prints the core binding result on the specified file in the current directory. The file is automatically generated in bind_core_xxxx_xx_xx_xx_xx_xx.log format, where xxxx_xx_xx_xx_xx_xx indicates the timestamp. As shown in Figure 4, core binding is complete.

Figure 4 Core binding completed in attach mode
After the core binding tool is running, it automatically checks the current training process within 30 seconds. If the training process is not started within 30 seconds, an error is reported, causing the core binding failure. Use -t or --time to set the delay based on the data preprocessing time required for model training. The command is as follows:
```
python3 bind_core.py --time 60 # The core binding tool will run in 60 seconds.
```
Run the following command to open the result file and view the core binding result, as shown in Figure 5:
```
cat bind_core_xxxx_xx_xx_xx_xx_xx.log
```
Figure 5 Core binding result in attach mode

In the launch and attach modes, after the core binding command is executed, run the htop command. The following information will be displayed, indicating that the CPU usage is balanced, as shown in Figure 6.

Figure 6 Balanced core CPU usage

Parent topic: Basic Tuning