Transpose

Applicability

Product

Supported/Unsupported

Atlas A3 training products/Atlas A3 inference products

Atlas A2 training products/Atlas A2 inference products

Atlas 200I/500 A2 inference products

Atlas inference product's AI Core

Atlas inference product's Vector Core

x

Atlas training products

Function Usage

Performs transpose on a 16 x 16 2D matrix data block, or between [N,C,H,W] and [N,H,W,C].

Prototype

  • Perform common transpose on a 16 x 16 2D matrix data block.
    1
    2
    template <typename T>
    __aicore__ inline void Transpose(const LocalTensor<T>& dst, const LocalTensor<T>& src)
    
  • Perform enhanced transpose on a 16 x 16 2D matrix data block, or between [N,C,H,W] and [N,H,W,C].
    1
    2
    template <typename T>
    __aicore__ inline void Transpose(const LocalTensor<T>& dst, const LocalTensor<T> &src, const LocalTensor<uint8_t> &sharedTmpBuffer, const TransposeParamsExt &transposeParams)
    

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Operand data type.

  • Common transpose:

    For the Atlas inference product's AI Core, the supported data types are uint16_t, int16_t, and half.

    For the Atlas inference product's AI Core, the supported data types are uint16_t, int16_t, and half.

    For the Atlas 200I/500 A2 inference products, the supported data types are uint16_t, int16_t, and half.

    For the Atlas inference product's AI Core, the supported data types are uint16_t, int16_t, and half.

    For the Atlas training products, the supported data types are uint16_t, int16_t, and half.

  • Enhanced transpose:
    • transposeType = TRANSPOSE_ND2ND_B16:

      For the Ascend 610's AI Core, the supported data type is uint16_t.

      For the Atlas A2 training products/Atlas A2 inference products, the supported data type is uint16_t.

      For the Atlas 200I/500 A2 inference products, the supported data type is uint16_t.

      For the Atlas inference product's AI Core, the supported data type is uint16_t.

    • transposeType = TRANSPOSE_NCHW2NHWC or TRANSPOSE_NHWC2NCHW:

      For the Atlas A3 training products/Atlas A3 inference products, the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float.

      For the Atlas A2 training products/Atlas A2 inference products, the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float.

      For the Atlas inference product's AI Core, the supported data types are int8_t, uint8_t, int16_t, uint16_t, half, int32_t, uint32_t, and float.

Table 2 Parameters

Parameter

Input/Output

Meaning

dst

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

src

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

The data type must be the same as that of dst.

sharedTmpBuffer

Input

Shared temporary buffer. For details about the size of sharedTmpBuffer, see Table 4.

transposeParams

Input

Data structure for controlling Transpose. The structure contains the input shape information and the transposeType parameter. For details about the definition of this data structure, see Table 3.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
struct TransposeParamsExt {
    __aicore__ TransposeParamsExt() {}
    __aicore__ TransposeParamsExt(const uint16_t nSizeIn, const uint16_t cSizeIn, const uint16_t hSizeIn,
        const uint16_t wSizeIn, const TransposeType transposeTypeIn)
        : nSize(nSizeIn),
          cSize(cSizeIn),
          hSize(hSizeIn),
          wSize(wSizeIn),
          transposeType(transposeTypeIn)
    {}
    uint16_t nSize = 0;
    uint16_t cSize = 0;
    uint16_t hSize = 0;
    uint16_t wSize = 0;
    TransposeType transposeType = TransposeType::TRANSPOSE_ND2ND_B16;
};
Table 3 Parameters in the TransposeParamsExt structure

Parameter

Meaning

nSize

Length of the n axis. Defaults to 0.

  • In the transpose of the 2D matrix data block, this parameter does not need to be passed in. The input value is invalid.
  • In the transpose between [N,C,H,W] and [N,H,W,C], the value range is [0, 65535].

cSize

Length of the c axis. Defaults to 0.

  • In the transpose of the 2D matrix data block, this parameter does not need to be passed in. The input value is invalid.
  • In the transpose between [N,C,H,W] and [N,H,W,C], the value range is [0, 4095].

hSize

Length of the h axis. Defaults to 0.

  • In the transpose of the 2D matrix data block, the value is fixed at 16.
  • In the transpose between [N,C,H,W] and [N,H,W,C], the value range of hSize × wSize is [0, 4095]. The value of hSize × wSize × sizeof (datatype) must be 32-byte aligned.

wSize

Length of the w axis. Defaults to 0.

  • In the transpose of the 2D matrix data block, the value is fixed at 16.
  • In the transpose between [N,C,H,W] and [N,H,W,C], the value range of hSize × wSize is [0, 4095]. The value of hSize × wSize × sizeof (datatype) must be 32-byte aligned.

transposeType

Data layout and reshape type. The type is the TransposeType enumeration type. The default value is TRANSPOSE_ND2ND_B16.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
enum class TransposeType : uint8_t {
    TRANSPOSE_TYPE_NONE, // The API does not perform any operation.
    TRANSPOSE_NZ2ND_0213,          // Not supported currently.
    TRANSPOSE_NZ2NZ_0213,          // Not supported currently.
    TRANSPOSE_NZ2NZ_012_WITH_N,    // Not supported currently.
    TRANSPOSE_NZ2ND_012_WITH_N,    // Not supported currently.
    TRANSPOSE_NZ2ND_012_WITHOUT_N, // Not supported currently.
    TRANSPOSE_NZ2NZ_012_WITHOUT_N, // Not supported currently.
    TRANSPOSE_ND2ND_ONLY,          // Not supported currently.
    TRANSPOSE_ND_UB_GM,            // Not supported currently.
    TRANSPOSE_GRAD_ND_UB_GM,       // Not supported currently.
    TRANSPOSE_ND2ND_B16,           // Transpose the [16,16] two-dimensional matrix.
    TRANSPOSE_NCHW2NHWC,           // [N,C,H,W]->[N,H,W,C]
    TRANSPOSE_NHWC2NCHW            // [N,H,W,C]->[N,C,H,W]
};
Table 4 Size required by the enhanced transpose API sharedTmpBuffer

transposeType

Size of the sharedTmpBuffer

TRANSPOSE_ND2ND_B16

No temporary buffer is required.

TRANSPOSE_NCHW2NHWC

For the following models:

  • Atlas inference product's AI Core

No temporary buffer is required.

For the following models:

  • Atlas A2 training products/Atlas A2 inference products
  • Atlas A3 training products/Atlas A3 inference products

The size of the temporary buffer is calculated according to the following calculation rules (pseudocode):

1
2
3
auto h0 = 16; // When the bit width of the data type is 8, h0 = 32. In other cases, h0 = 16.
auto w0 = 32 / sizeof(type); // type indicates the data type.
auto tmpBufferSize = (cSize + 2)  * h0 * w0 * sizeof(type);

TRANSPOSE_NHWC2NCHW

For the following models:

  • Atlas inference product's AI Core

No temporary buffer is required.

For the following models:

  • Atlas A2 training products/Atlas A2 inference products
  • Atlas A3 training products/Atlas A3 inference products

The size of the temporary buffer is calculated according to the following calculation rules (pseudocode):

1
2
3
auto h0 = 16; // When the bit width of the data type is 8, h0 = 32. In other cases, h0 = 16.
auto w0 = 32 / sizeof(type); // type indicates the data type.
auto tmpBufferSize = (cSize  * 2 + 1)  * h0 * w0 * sizeof(type);

Returns

None

Restrictions

  • For details about the operand address alignment requirements, see General Address Alignment Restrictions.
  • The common transpose interface supports the reuse of src and dst.
  • The transpose interface is enhanced. When transposeType is set to TRANSPOSE_ND2ND_B16, the src and dst can be reused. When transposeType is set to TRANSPOSE_NCHW2NHWC or TRANSPOSE_NHWC2NCHW, the src and dst cannot be reused.

Examples

  • The following example transposes the [16,16] matrix of the half type by using a common API.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    #include "kernel_operator.h"
    
    class KernelTranspose {
    public:
        __aicore__ inline KernelTranspose() {}
        __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half*)src);
            dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
    
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
            AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
    
            AscendC::Transpose<half>(dstLocal, srcLocal);
    
            outQueueDst.EnQue<half>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
            AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::TPosition::VECOUT, 1> outQueueDst;
    
        AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
        int srcDataSize = 256;
        int dstDataSize = 256;
    };
    
    extern "C" __global__ __aicore__ void transpose_kernel(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
    {
        KernelTranspose op;
        op.Init(src, dstGm);
        op.Process();
    }
    
    Input (src_gm):
    [[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
       14.  15.]
     [ 16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.  28.  29.
       30.  31.]
     [ 32.  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.  43.  44.  45.
       46.  47.]
     [ 48.  49.  50.  51.  52.  53.  54.  55.  56.  57.  58.  59.  60.  61.
       62.  63.]
     [ 64.  65.  66.  67.  68.  69.  70.  71.  72.  73.  74.  75.  76.  77.
       78.  79.]
     [ 80.  81.  82.  83.  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.
       94.  95.]
     [ 96.  97.  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109.
      110. 111.]
     [112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
      126. 127.]
     [128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141.
      142. 143.]
     [144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157.
      158. 159.]
     [160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173.
      174. 175.]
     [176. 177. 178. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 189.
      190. 191.]
     [192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205.
      206. 207.]
     [208. 209. 210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221.
      222. 223.]
     [224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237.
      238. 239.]
     [240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253.
      254. 255.]]
    
    Output (dst_gm):
    [[  0.  16.  32.  48.  64.  80.  96. 112. 128. 144. 160. 176. 192. 208.
      224. 240.]
     [  1.  17.  33.  49.  65.  81.  97. 113. 129. 145. 161. 177. 193. 209.
      225. 241.]
     [  2.  18.  34.  50.  66.  82.  98. 114. 130. 146. 162. 178. 194. 210.
      226. 242.]
     [  3.  19.  35.  51.  67.  83.  99. 115. 131. 147. 163. 179. 195. 211.
      227. 243.]
     [  4.  20.  36.  52.  68.  84. 100. 116. 132. 148. 164. 180. 196. 212.
      228. 244.]
     [  5.  21.  37.  53.  69.  85. 101. 117. 133. 149. 165. 181. 197. 213.
      229. 245.]
     [  6.  22.  38.  54.  70.  86. 102. 118. 134. 150. 166. 182. 198. 214.
      230. 246.]
     [  7.  23.  39.  55.  71.  87. 103. 119. 135. 151. 167. 183. 199. 215.
      231. 247.]
     [  8.  24.  40.  56.  72.  88. 104. 120. 136. 152. 168. 184. 200. 216.
      232. 248.]
     [  9.  25.  41.  57.  73.  89. 105. 121. 137. 153. 169. 185. 201. 217.
      233. 249.]
     [ 10.  26.  42.  58.  74.  90. 106. 122. 138. 154. 170. 186. 202. 218.
      234. 250.]
     [ 11.  27.  43.  59.  75.  91. 107. 123. 139. 155. 171. 187. 203. 219.
      235. 251.]
     [ 12.  28.  44.  60.  76.  92. 108. 124. 140. 156. 172. 188. 204. 220.
      236. 252.]
     [ 13.  29.  45.  61.  77.  93. 109. 125. 141. 157. 173. 189. 205. 221.
      237. 253.]
     [ 14.  30.  46.  62.  78.  94. 110. 126. 142. 158. 174. 190. 206. 222.
      238. 254.]
     [ 15.  31.  47.  63.  79.  95. 111. 127. 143. 159. 175. 191. 207. 223.
      239. 255.]]
  • The following example transposes [N,C,H,W] to [N,H,W,C] of the half type by using an enhanced API.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    #include "kernel_operator.h"
    
    template <typename T>
    class Kernel4dTrans {
    public:
        __aicore__ inline Kernel4dTrans() {}
        __aicore__ inline void Init(__gm__ uint8_t *srcGm, __gm__ uint8_t *dstGm)
        {
            inputSize = N * C * H * W;
            tmpBufferSize = (C + 2) * 16 * 16;
            srcGlobal.SetGlobalBuffer((__gm__ T *)srcGm);
            dstGlobal.SetGlobalBuffer((__gm__ T *)dstGm);
            pipe.InitBuffer(inQueueSrcVecIn, 1, inputSize*sizeof(T));
            pipe.InitBuffer(inQueueSrcVecOut, 1, inputSize*sizeof(T));
            pipe.InitBuffer(tmpQueue, 1, tmpBufferSize * sizeof(T));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<T> srcLocal = inQueueSrcVecIn.AllocTensor<T>();
            AscendC::DataCopy(srcLocal, srcGlobal, inputSize);
            inQueueSrcVecIn.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<T> srcLocal = inQueueSrcVecIn.DeQue<T>();
            AscendC::LocalTensor<T> dstLocal = inQueueSrcVecOut.AllocTensor<T>();
            AscendC::LocalTensor<uint8_t> stackBuffer = tmpQueue.AllocTensor<uint8_t>();
    
            AscendC::TransposeParamsExt transposeParams;
            transposeParams.nSize = N;
            transposeParams.cSize = C;
            transposeParams.hSize = H;
            transposeParams.wSize = W;
            transposeParams.transposeType = transposeType;
            AscendC::Transpose(dstLocal, srcLocal, stackBuffer, transposeParams);
            inQueueSrcVecOut.EnQue<T>(dstLocal);
            inQueueSrcVecIn.FreeTensor(srcLocal);
            tmpQueue.FreeTensor(stackBuffer);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<T> dstLocal = inQueueSrcVecOut.DeQue<T>();
            AscendC::DataCopy(dstGlobal, dstLocal, inputSize);
            inQueueSrcVecOut.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::TPosition::VECIN, 1> inQueueSrcVecIn;
        AscendC::TQue<AscendC::TPosition::VECOUT, 1> inQueueSrcVecOut;
        AscendC::TQue<AscendC::TPosition::VECCALC, 1> tmpQueue;
    
        AscendC::GlobalTensor<T> srcGlobal;
        AscendC::GlobalTensor<T> dstGlobal;
        uint32_t N = 3;
        uint32_t C = 3;
        uint32_t H = 2;
        uint32_t W = 8;
        uint32_t inputSize, tmpBufferSize;
        AscendC::TransposeType transposeType = AscendC::TransposeType::TRANSPOSE_NCHW2NHWC;
    };
    
    extern "C" __global__ __aicore__ void transpose_kernel(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm)
    {
        Kernel4dTrans<half>op;
        op.Init(srcGm, dstGm);
        op.Process();
    }