Brcb

Applicability

Product

Supported/Unsupported

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products

x

Atlas inference product's AI Core

Atlas inference product's Vector Core

x

Atlas training products

x

Function Usage

Extracts eight elements from a given input tensor each time and fills them in eight data blocks (32 bytes) in the result tensor. Each element corresponds to a data block.

Prototype

1
2
template <typename T>
__aicore__ inline void Brcb(const LocalTensor<T>& dst, const LocalTensor<T>& src0, const uint8_t repeatTime, const BrcbRepeatParams& repeatParams)

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Operand data type.

For the Atlas A3 training products/Atlas A3 inference products, the supported data types are int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, and float.

For the Atlas A2 training products/Atlas A2 inference products, the supported data types are int16_t, uint16_t, int32_t, uint32_t, half, bfloat16_t, and float.

For the Atlas inference product's AI Core, the supported data types are int16_t, uint16_t, int32_t, uint32_t, half, and float.

Table 2 Parameters

Parameter

Input/Output

Meaning

dst

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

src0

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

Has the same data type as dst.

Eight elements are read from src0 in each iteration. Therefore, the number of elements in src0 must be greater than or equal to 8 x repeatTime.

repeatTime

Input

Number of instruction repeats. Eight data blocks are collected in each repeat. Data range: repeatTimes ∈ [0, 255]

repeatParams

Input

Controls parameters related to instruction iteration.

The type is BrcbRepeatParams. For details, see ${INSTALL_DIR}/include/ascendc/basic_api/interface/kernel_struct_brcb.h. Replace ${INSTALL_DIR} with the actual CANN component directory.

dstBlkStride and dstRepStride can be set by users. For details about the parameters, see Table 3.

Table 3 Parameters in the BrcbRepeatParams structure

Parameter

Meaning

dstBlkStride

Address stride of the vector destination operand between different data blocks in a single repeat

Note: When dstBlkStride is set to 0, the value 1 is used by default.

dstRepStride

Address stride of the vector destination operand for the same data block between adjacent repeats

blockNumber

Reserved. This parameter is reserved for future functions. You can use the default value.

src0BlkStride

src1BlkStride

src0RepStride

src1RepStride

repeatStrideMode

strideSizeMode

Returns

None

Restrictions

  • src0 and dst cannot be the same memory address.
  • For the Atlas Inference Series Product's AI Core, reserve an 8 KB as the temporary data storage of the API.

Examples

Example of brcb of the uint16_t data type
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
#include "kernel_operator.h"
class VbrcbCase {
public:
    __aicore__ inline VbrcbCase()
    {}
    __aicore__ inline void Init(__gm__ uint8_t *x, __gm__ uint8_t *y)
    {
        x_gm.SetGlobalBuffer(reinterpret_cast<__gm__ uint16_t *>(x));
        y_gm.SetGlobalBuffer(reinterpret_cast<__gm__ uint16_t *>(y));
        tpipe.InitBuffer(vecIn, 1, 16 * sizeof(uint16_t));
        tpipe.InitBuffer(vecOut, 1, 256 * sizeof(uint16_t));
    }
    __aicore__ inline void Process()
    {
        CopyIn();
        Compute();
        CopyOut();
    }
    __aicore__ inline void CopyIn()
    {
        auto x_buf = vecIn.AllocTensor<uint16_t>();
        AscendC::DataCopy(x_buf, x_gm, 16);
        vecIn.EnQue(x_buf);
    }
    __aicore__ inline void Compute()
    {
        auto x_buf = vecIn.DeQue<uint16_t>();
        auto y_buf = vecOut.AllocTensor<uint16_t>();
        AscendC::Brcb(y_buf, x_buf, 2, {1,8});
        vecOut.EnQue(y_buf);
        vecIn.FreeTensor(x_buf);
    }
    __aicore__ inline void CopyOut()
    {
        auto y_buf = vecOut.DeQue<uint16_t>();
        AscendC::DataCopy(y_gm, y_buf, 256);
        vecOut.FreeTensor(y_buf);
    }
private:
    AscendC::GlobalTensor<uint16_t> x_gm;
    AscendC::GlobalTensor<uint16_t> y_gm;
    AscendC::TPipe tpipe;
    AscendC::TQue<AscendC::TPosition::VECIN, 1> vecIn;
    AscendC::TQue<AscendC::TPosition::VECOUT, 1> vecOut;
};
extern "C" __global__ __aicore__ void vbrcb_uint16_t_16(__gm__ uint8_t *x, __gm__ uint8_t *y)
{
    VbrcbCase op;
    op.Init(x, y);
    op.Process();
}
Result example:
Input data x_gm: [1 2 3... 16]
Output data y_gm: [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 ... 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 15 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16 16]