SIMD Built-In Keywords and APIs
Predefined Macros
Like other languages, some built-in macros are provided for users to write programs. The part describes some macros that are frequently used during heterogeneous programming.
- __NPU_ARCH__
__NPU_ARCH__ is a preprocessing macro in the AI Core code on the device, which is used to identify the architecture version of the AI processor. This macro consists of four digits. The first three digits indicate the type of the intellectual property core (IP core) of the AI Core, and the fourth digit indicates the configuration version of the same IP core of the AI Core. With this macro, you can differentially adapt and optimize code for different AI processors. The following table lists the mapping between Ascend AI Processor models and __NPU_ARCH__.
Table 1 Mapping between Ascend AI Processor models and __NPU_ARCH__ Ascend AI Processor Model
__NPU_ARCH__
Atlas A3 training products /Atlas A3 inference products 2201
Atlas A2 training products /Atlas A2 inference products 2201
Atlas 200I/500 A2 inference products 3002
Atlas inference products 2002
Atlas training products 1001
The following is an example of using __NPU_ARCH__ to control the rounding mode of the operator output value on different AI processors:1 2 3 4 5 6 7 8 9 10 11 12
__aicore__ static inline void CopyOut(uint64_t mulLen) { #if __NPU_ARCH__ == 2002 Cast(dstLocal, srcLocal, RoundMode::CAST_NONE, mulLen); // CAST_NONE indicates that the CAST_RINT mode is used when precision loss occurs during conversion. If precision loss is not involved, rounding is not performed. #elif __NPU_ARCH__ == 2201 Cast(dstLocal, srcLocal, RoundMode::CAST_RINT, mulLen); // CAST_RINT indicates the banker's rounding mode. #endif event_t eventVToMTE3 = static_cast<event_t>(GetTPipePtr()->FetchEventID(HardEvent::V_MTE3)); SetFlag<HardEvent::V_MTE3>(eventVToMTE3); WaitFlag<HardEvent::V_MTE3>(eventVToMTE3); CommonCopyOut<float>(dstLocal, mulLen); // Copy a LocalTensor to a GlobalTensor. }
- ASCEND_IS_AIV, ASCEND_IS_AIC
ASCEND_IS_AIV and ASCEND_IS_AIC are conditional statements implemented using C++ macros. They are used to implement conditional compilation of code in functions decorated with __aicore__. When developing a fused operator in separation mode (AIC and AIV cores are separated), the operator logic involves both AIV and AIC core processing logic, and inter-core synchronization is required. In this case, ASCEND_IS_AIV and ASCEND_IS_AIC are used to isolate the AIV and AIC core code.
When MatMul high-level APIs are used, the AIV and AIC core code has been isolated using the REGIST_MATMUL_OBJ macro. You do not need to use this macro for processing.
Take the MatmulNzCustom operator as an example. In separated mode, this operator needs to implement different logic on the AIV core and AIC core. Specifically, the AIV core is responsible for moving cube data into the Unified Buffer, rearranging the data (converting the cube data into the NZ format), and writing it to the global memory. The AIC core directly reads the rearranged NZ-format data from the global memory and performs cube multiplication (MatMul) computation. Because the code logic of the AIV core is different from that of the AIC core, the ASCEND_IS_AIV and ASCEND_IS_AIC macros are used to isolate the code to ensure that the code applicable to the AIV core and and that applicable to the AIC core are generated separately during compilation.
The example pseudocode is as follows:1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34
template <typename AType, typename BType, typename CType, typename BiasType> __aicore__ inline void MatmulKernel<AType, BType, CType, BiasType>::Process(AscendC::TPipe *pipe) { // Use the Vector Unit of the AIV core to implement ND2NZ format conversion. In the following code, MatrixBtoNZ is the function for performing ND2NZ format conversion on cube B. if ASCEND_IS_AIV { pipe->InitBuffer(ubBuf, TOTAL_UB_SIZE); MatrixBtoNZ<typename B_TYPE::T>(tempGM, bGMNZ, tiling, isTransB, ubBuf, tiling.baseK, tiling.baseN); // ND2NZ function implemented on the vector SyncAll(); // Synchronization between the AIC and AIV cores AscendC::CrossCoreSetFlag<0x2, PIPE_MTE3>(0x4); return; } if ASCEND_IS_AIC { AscendC::CrossCoreWaitFlag(0x4); // Wait for the AIV core to complete ND2NZ format conversion. } ... ... // Set the left cube A, right cube B, and bias. matmulObj.SetTail(tailM, tailN); matmulObj.SetTensorA(aGlobal, false); matmulObj.SetTensorB(bGlobal, false); if (tiling.isBias) { matmulObj.SetBias(biasGlobal); } // Complete the cube multiplication operation. matmulObj.IterateAll(cGlobal); // End the cube multiplication operation. matmulObj.End(); }
- ASCENDC_CUBE_ONLY
ASCENDC_CUBE_ONLY is a conditional statement implemented using C++ macros. It is used to implement conditional compilation of code in functions decorated with __aicore__.
When developing non-fused operators in separation mode, you can set ASCENDC_CUBE_ONLY to enable the cube-only mode for MatMul computation in scenarios where only cube computation is involved. This reduces the performance overhead of message communication and improves operator performance.
The ASCENDC_CUBE_ONLY macro must be set before #include "lib/matmul_intf.h".
Take the matmul_custom operator as an example. By default, the MatMul high-level API uses the MIX mode. That is, a user initiates a message from the AIV, and MatMul computation is performed on the AIC after the message is forwarded by the message communication framework. This message processing mechanism causes extra scalar performance overhead. Compared with the MIX mode, the cube-only mode can directly skip the message communication framework to complete MatMul computation, improving operator performance.
The example pseudocode is as follows:
1 2 3 4 5 6 7 8
#define ASCENDC_CUBE_ONLY #include "lib/matmul_intf.h" using A_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, AType>; using B_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BType>; using C_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, CType>; using BIAS_TYPE = AscendC::MatmulType<AscendC::TPosition::GM, CubeFormat::ND, BiasType>; AscendC::Matmul<A_TYPE, B_TYPE, C_TYPE, BIAS_TYPE, CFG_NORM> matmulObj;
Function Execution Space Qualifier
The function execution space qualifier indicates whether the function is executed on the host or device, and whether the function can be called from the host or device.
- __global__
The execution space qualifier __global__ declares a kernel function. The kernel function is executed on the device, and called only by the function on the host. __global__ only indicates the entrypoint of the function on the device, rather than the specific device type. The specific device type is marked by __aicore__. The restrictions are as follows:
- A __global__ function must return the void type, and cannot be a member function of class.
- The host must call the __global__ function using the <<<>>> heterogeneous call syntax.
- The __global__ call is asynchronous. The return of the function does not mean that the kernel function has been completely executed on the device. For explicit synchronization, use the runtime synchronization API, such as aclrtSynchronizeStream.
- __aicore__
The execution space qualifier __aicore__ declares a function with the following attributes:
- Executed on the device
- Called only by the __global__ function or other __aicore__ functions
// Only callable from device functions with same kind // of execution space __aicore__ void bar() {} // Define a kernel function execute on AI Core device __global__ __aicore__ void foo() { bar(); // OK. }
- __host__
The execution space qualifier __host__ declares a function with the following attributes:
- Executed only on the host
- Called only by functions on the host
- __global__ and __host__ cannot be used together.
The __host__ qualifier is optional. If a function is not defined with any execution space qualifier, it is a host function by default.
__aicore__ int f() {} // defines a host side function int foo() {} // defines a host side function __host__ int bar() { f(); // Error. foo(); // OK. } // Error. __global__ __host__ void kfunc() {}
- __inline__
The __inline__ qualifier declares a function with the following attributes:
- Indicates that the function on the device is forcibly inlined. This reduces the overhead of instruction stack push and pop caused by frequent function calls, but may increase the operator binary size.
- The main difference between __inline__ on the device and the C++ function modifier inline is that __inline__ indicates forcible inline on the device, while inline in C++ indicates selective inline based on compiler optimization.
- AI Core has restrictions on the function nesting depth. It is recommended that the nesting depth be less than or equal to four layers. Using forcible inline can reduce the number of call layers.
- __cube__
This qualifier indicates that the kernel function is executed only on the Cube core. This qualifier does not take effect for the hardware architecture in coupled mode.
__vector__ __global__ __aicore__ void add_custom(){}
- __vector__
This qualifier indicates that the kernel function is executed only on the Vector core. This qualifier does not take effect for the hardware architecture in coupled mode.
- __mix__(cube, vec)
This qualifier indicates that the kernel function is executed on both the Cube and Vector cores. (cube, vec) indicates the ratio of Cube cores to Vector cores started by the kernel function. The supported ratios are (1, 0), (0, 1), (1, 1), and (1, 2). This qualifier does not take effect for the hardware architecture in coupled mode.
Address Space Qualifiers
The AI Core has multi-level independent on-chip storage. Each address space has independent addressing and its own memory access instructions. Depending on the architecture, some storage spaces have a generic address space, while others do not. On-device programming allows address spaces to be used as valid type qualifiers based on syntax extensions, providing access to different address spaces and check of address space validity.
|
Address Space Qualifier |
AI Core Physical Storage Space |
|---|---|
|
__gm__ |
Global memory on the device |
|
__ubuf__ |
Vector Unified Buffer |
|
__ca__ |
Cube L0A Buffer |
|
__cb__ |
Cube L0B Buffer |
|
__cc__ |
Cube L0C Buffer |
|
__cbuf__ |
Cube L1 Buffer |
|
__fbuf__ |
FixPipe Buffer |
Address space qualifiers can be used in variable declarations to specify the region in which an object is allocated. If the type of an object is qualified by an address space name, the object will be allocated in the specified address space. Similarly, for pointers, the pointing type can be qualified by an address space to indicate the address space in which the pointed-to object resides.
// declares a pointer p in the __gm__ address space that
// points to an object(has int type) in the __gm__ address space
__gm__ int *p;
__global__ __aicore__ void foo(...)
{
// declares an array of 4 floats in the private address space.
float x[4];
}
Address space qualifiers cannot be used for non-pointer return types, non-pointer function parameters, or function types. Multiple address space qualifiers cannot be used for the same type.
// OK.
__aicore__ int f() {...}
// Error. Address space qualifier cannot be used with a non-pointer return type.
__ubuf__ int f() { ... }
// OK. Address space qualifier can be used with a pointer return type.
__ubuf__ int *f() { ... }
// Error. Multiple address spaces specified for a type.
__ubuf__ __gm__ int i;
// OK. The first address space qualifies the object pointed to and the second
// qualifies the pointer.
__ubuf__ int * __gm__ ptr;
Important: The pointer size may vary with the address space. For example, sizeof(__gm__ int *) may not always be equal to sizeof(__ubuf__ int *). The compiler may store __ubuf__ pointers in 32-bit mode on some systems.
- Private address space
The private address space is the default address space for most variables, especially local variables.
// m is in a specific kernel parameter address space, // it's physical location is implementation determined. __global__ __aicore__ void foo(int m) { // OK. i is an int variable allocated in private address space int i; } __aicore__ void bar(int k) { //OK. k is in private address space // OK. i is an int variable allocated in private address space int i; }
- __gm__ address space
The __gm__ address space qualifier is used to indicate objects allocated to the global memory on the device. Global memory objects can be declared as scalars or pointers to user-defined structures.
__gm__ int *var; // var point to an array of int elements typedef struct { float a[3]; int b[2]; } foo_t; __gm__ foo_t *info; // info point to an array of foo_t elements
- __ubuf__ address space
The __ubuf__ address space is used to describe variables stored in the UB storage space of the AI Core.
__global__ __aicore__ void foo() { // ptr is in private address space, point to __ubuf__ __ubuf__ int *ptr; }
- __ca__, __cb__, __cc__, and __cbuf__ address spaces
The preceding address spaces are mainly used for access by specific DMA instructions and do not support direct access by scalars.
class ObjTy{ ObjTy(){...} void print(){...} private: int a; int b; }; __global__ __aicore__ void foo(__ca__ int * ptr) { // Error. Cannot have __ca__ // qualifier in kernel arguments // OK __ca__ int *ptr; }
Built-in Constants
|
Constant |
Value |
Function |
|---|---|---|
|
constexpr int32_t g_coreType |
|
The constant value is automatically set by the framework. For the AIC core, set it to AscendC::AIC. For the AIV core, set it to AscendC::AIV. By checking the constant value, you can distinguish and isolate the AIV and AIC core code. The function is the same as that of directly using ASCEND_IS_AIV and ASCEND_IS_AIC. |
Built-in Variables
|
Variable |
API |
Function |
|---|---|---|
|
block_num |
Number of cores configured for the current task, which is used for multi-core logic control in the code. |
|
|
block_idx |
Index of the current core, which is used for multi-core logic control and multi-core offset computation in the code. |
Generally, you are advised to use the APIs corresponding to the built-in variables to obtain the required values. You are advised not to directly use the built-in variables. This is because a built-in variable reflects the configuration information of a single hardware resource. For functions of integrating hardware resources and extending hardware by the software stack, the values of built-in variables may not match the actual semantics.
For example, in
Built-in APIs
For details about the API list, see the CCE Intrinsic Development APIs.