AI CPU Programming
The AI CPU is an Arm64-based processor on the device. It has the same memory access capability as the AI Core and can directly access memory resources on the device. It can also perform similar data compute as the CPU on the host. It is usually used as a supplement to the AI Core and is mainly responsible for non-cube and logic-complex branch-intensive compute. The AI CPU runs in a basic Linux environment. During programming, you can use the libc library, C++ standard library, and STL template library. The following figure shows its hardware architecture.
AI CPU Kernel Function Definition
During AI CPU programming, similar to AI Core, you also need to define the function entry (kernel function) on the device. This function must be declared using the __aicpu__ identifier and must be used together with the __global__ identifier to indicate that it can be called only by the host. The on-device implementation file of the AI CPU must be suffixed with .aicpu (or the -x aicpu option must be added during compilation). This implementation file contains the kernel function and AI CPU common function definitions described above. The execution space identifier does not need to be added for AI CPU common functions.
The following is an example of an AI CPU "Hello World" program. The content of the hello_world.aicpu file is as follows:
1 2 3 4 5 6 7 8 |
// Header file required for calling the printf API #include "aicpu_api.h" __global__ __aicpu__ uint32_t hello_world(void *args) { AscendC::printf("Hello World!!!\n"); return 0; } |
Follow the following rules during programming:
- An __aicpu__ __global__ function cannot return the void type, and the input parameter can only be a pointer.
- An __aicpu__ __global__ function cannot be a member function of a class and cannot exist in an anonymous space.
- Although the AI CPU kernel function has a return value, the return value is used only by the Runtime component to report the operating status. You do not need to write the return logic, and the return value cannot be used. Therefore, for users, the AI CPU kernel function is equivalent to the void type and cannot be used as the right value.
AI CPU Kernel Function Call
1
|
hello_world<<<blockDim, nullptr, stream>>>(&args, sizeof(KernelArgs)); |
- blockDim: Currently, the AI CPU device does not support the core allocation logic. Therefore, calling multiple cores on the host are meaningless. The value 1 is recommended.
- l2ctrl: reserved parameter. Currently, the value is fixed to nullptr. You do not need to pay attention to it.
- stream: aclrtStream type. Streams preserve the order of a stack of asynchronous operations being executed on the device. For details about management APIs for stream creation and other operations, see "Application Development APIs > AscendCL API (C&C++) > Runtime Management > Stream Management".
Follow the following rules when writing the calling code:
- The __aicpu__ __global__ function cannot be defined in the .asc file. It can only be declared and must be declared using extern.
- When the __global__ __aicpu__ function is called on the host, the <<<>>> heterogeneous call syntax must be used. The input function parameter must include the size of the data read from the pointer based on the input parameter pointer.
- When the kernel launch symbol <<<...>>> is used on the host to call AI Core and AI CPU operators, the same stream cannot be used.
When loading and running operators, you need to use runtime APIs for runtime management and configuration. For details, see Operator Execution. For details about how to build AI CPU operators, see AI CPU Operator Compilation.
AI CPU Template Kernel Function
template<typename T, int BUFF_SIZE>
__global__ __aicpu__ uint32_t hello_world(void *args)
{
AscendC::printf("Hello World!!!\n");
AscendC::printf("buffer_size is %d\n", BUFF_SIZE);
return 0;
}
template __global__ __aicpu__ uint32_t hello_world<KernelArgs, 4096>(void *args);
Add the extern declaration for template kernel function instantiation to the .asc file.
template<typename T, int BUFF_SIZE> extern __global__ __aicpu__ uint32_t hello_world(void *args); template extern __global__ __aicpu__ uint32_t hello_world<KernelArgs, 4096>(void *args);