AI Core Parallelism

In the prototype definition of for_range, you can set the block_num parameter to implement parallel execution on block_num number of AI Cores. The following is a simple code example.

with tik_instance.for_range( 0, 10, block_num=10) as i:

In the preceding example, the expression in the for_range loop is applied to 10 execution instances to allocate them to 10 cores for parallel execution. Each core is allocated with an execution instance and a unique block ID. If the number of available cores is less than 10, the execution instances will be scheduled and executed in batches on these cores. If the number of available cores is at least 10, the execution instances are scheduled to some of them for execution.

You can obtain the number of AI Cores by using the get_soc_spec call.

# Set this parameter based on the Ascend AI Processor version.
soc_version="xxx"
# Set the Ascend AI Processor version and type of the target core.
tbe.common.platform.set_current_compile_soc_info(soc_version) 
tbe.common.platform.get_soc_spec("CORE_NUM") # Set the AI Processor type before using this API.

Notes:

The Global Memory is visible to all cores. When core parallelism is enabled, tensors in the Global Memory must be defined outside the for_range loop. On the contrary, tensors stored in the Scalar Buffer and Unified Buffer are visible only to the target cores and thus must be defined inside the loop.
block_num defaults to 1, indicating that AI Core parallelism is disabled. Ensure that the actual value of block_num does not exceed the threshold 65535.
For the purpose of load balancing, set block_num to a multiple of the available number of cores. Assume that the AI Processor contains 32 AI Cores. For a tensor with the shape (16, 2, 32, 32, 32), to perform AI Core parallelism along the first dimension (the outermost dimension), up to 16 AI Cores can be bound. To utilize more AI Cores, you can reshape the tensor to (32, 32, 32, 32). In this way, tasks can be scheduled to up to 32 AI Cores for execution. Note that, due to the automatic memory allocation mechanism of the backend, reshape must start from the outermost dimension.
In an operator, the for_range loop can be called only once to implement AI Core parallelism (block_num >= 2). That is, AI Core parallelism cannot be enabled repeatedly.

Note that if data written to the Global Memory is not 32-byte aligned, the non-aligned remainder elements may cause data overwriting when AI Cores write to the Global Memory. As shown in Figure 1, two AI Cores are used to process a 24-element 1D tensor of type float32 in parallel, and the result is output to the Global Memory. Subject to physical constraints, data movement to and from the Unified Buffer must be performed in the unit of 32 bytes. As shown in the following figure, the dotted elements of AI Core 2 are moved with elements 8–12 of AI Core 1 to the Global Memory, resulting in overwriting of the shaded elements and affecting compute accuracy.

Figure 1 Data overwriting caused by non-32-byte alignment in core parallelism

To ensure the compute accuracy, address rollback is performed when data with a length fewer than 32 bytes is moved into or out from the Unified Buffer to prevent data overwrites. As shown in Figure 2, there are 24 float32s in the Global Memory. Data is tiled into two parts: GM[0]–GM[11] distributed to AI Core 1 and GM[12]–GM[23] distributed to AI Core 2. Take AI Core 1 as an example. Elements 0–7 in the Global Memory are moved into the first block (32 bytes) of the Unified Buffer. The second movement rolls back and starts from element 4 as opposed to element 8, to meet the 32-byte alignment requirement. That is, elements 4–11 are moved to the second block, during which elements 4–7 (marked in gray) in the Global Memory are moved repeatedly. After the Vector Unit completes data compute process, the compute process results of the two blocks in the Unified Buffer are moved to the Global Memory separately in two times. Address rollback is used to ensure that the repeatedly moved data overwrites the same address block (shaded in gray) in the Global Memory. In this way, data less than 32 bytes on each AI Core can be correctly moved through address rollback, preventing data overwrites.

Figure 2 Data movement with address rollback

Parent topic: Performance Optimization in TIK Mode