Transpose

Function Usage

Performs transpose on a 16 x 16 2D matrix data block, or between [N,C,H,W] and [N,H,W,C].

Prototype

  • Perform common transpose on a 16 x 16 2D matrix data block.
    1
    2
    template <typename T>
    void Transpose(const LocalTensor<T>& dstLocal, const LocalTensor<T>& srcLocal)
    
  • Perform enhanced transpose on a 16 x 16 2D matrix data block, or between [N,C,H,W] and [N,H,W,C].
    1
    2
    template <typename T>
    void Transpose(const LocalTensor<T> &dstLocal, const LocalTensor<T> &srcLocal, const LocalTensor<uint8_t> &sharedTmpBuffer, const TransposeParamsExt &transposeParams)
    

Parameters

Table 1 Parameters in the template

Parameter

Description

T

Data type of the operand.

Common transpose API:

For the Atlas Training Series Product, the supported data types are uint16_t, int16_t, and half.

Enhanced transpose API:

See Table 4.

Table 2 Parameters

Parameter

Input/Output

Meaning

dstLocal

Output

Destination operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

srcLocal

Input

Source operand.

The type is LocalTensor, and the supported TPosition is VECIN, VECCALC, or VECOUT.

The start address of the LocalTensor must be 32-byte aligned.

The data type must be the same as that of dstLocal.

sharedTmpBuffer

Input

Tensor for the shared temporary buffer. For the size of sharedTmpBuffer, see Table 5.

transposeParams

Input

Data structure for controlling Transpose. The structure contains the input shape information and the transposeType parameter. For details about the definition of this data structure, see Table 3.

1
2
3
4
5
6
7
struct VtransposeParams {
    uint16_t nSize;
    uint16_t cSize;
    uint16_t hSize;
    uint16_t wSize;
    TransposeType transposeType;
};
Table 3 Parameters in the VtransposeParams struct

Parameter

Meaning

nSize

Length of the n axis. Defaults to 0.

  • In the transpose of the 2D matrix data block, this parameter does not need to be passed in. The input value is invalid.
  • In the transpose between [N,C,H,W] and [N,H,W,C], the value range is [0, 65535].

cSize

Length of the c axis. Defaults to 0.

  • In the transpose of the 2D matrix data block, this parameter does not need to be passed in. The input value is invalid.
  • In the transpose between [N,C,H,W] and [N,H,W,C], the value range is [0, 4095].

hSize

Length of the h axis. Defaults to 0.

  • In the transpose of the 2D matrix data block, the value is fixed at 16.
  • In the transpose between [N,C,H,W] and [N,H,W,C], the value range of hSize × wSize is [0, 4095]. The value of hSize × wSize × sizeof(datatype) must be 32-byte aligned.

wSize

Length of the w axis. Defaults to 0.

  • In the transpose of the 2D matrix data block, the value is fixed at 16.
  • In the transpose between [N,C,H,W] and [N,H,W,C], the value range of hSize × wSize is [0, 4095]. The value of hSize × wSize × sizeof(datatype) must be 32-byte aligned.

transposeType

Data layout and reshape type. The type is the TransposeType enumeration type.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
enum class TransposeType : uint8_t {
    TRANSPOSE_TYPE_NONE,           // default value
    TRANSPOSE_NZ2ND_0213,          // Not supported currently.
    TRANSPOSE_NZ2NZ_0213,          // Not supported currently.
    TRANSPOSE_NZ2NZ_012_WITH_N,    // Not supported currently.
    TRANSPOSE_NZ2ND_012_WITH_N,    // Not supported currently.
    TRANSPOSE_NZ2ND_012_WITHOUT_N, // Not supported currently.
    TRANSPOSE_NZ2NZ_012_WITHOUT_N, // Not supported currently.
    TRANSPOSE_ND2ND_ONLY,          // Not supported currently.
    TRANSPOSE_ND_UB_GM,            // Not supported currently.
    TRANSPOSE_GRAD_ND_UB_GM,       // Not supported currently.
    TRANSPOSE_ND2ND_B16,           // Transpose a [16,16] two-dimensional matrix.
    TRANSPOSE_NCHW2NHWC,           // [N,C,H,W]->[N,H,W,C]
    TRANSPOSE_NHWC2NCHW            // [N,H,W,C]->[N,C,H,W]
};
Table 4 Data types supported by the enhanced transpose API

transposeType

Supported Data Type

TRANSPOSE_ND2ND_B16

Note: To transpose a [16,16] two-dimensional matrix of the int16_t or half type, you can use the common transpose API.

TRANSPOSE_NCHW2NHWC

TRANSPOSE_NHWC2NCHW

Table 5 Size required by the enhanced transpose API sharedTmpBuffer

transposeType

Supported Data Type

TRANSPOSE_ND2ND_B16

TRANSPOSE_NCHW2NHWC

TRANSPOSE_NHWC2NCHW

Availability

Atlas Training Series Product

Precautions

  • For details about the alignment requirements of the operand address offset, see General Restrictions.
  • This instruction cannot be iterated (that is, it cannot be repeated through repeatTimes).
  • The common transpose API supports srcLocal and dstLocal reuse.
  • The enhanced transpose API supports srcLocal and dstLocal reuse only when transposeType is set to TRANSPOSE_ND2ND_B16 rather than TRANSPOSE_NCHW2NHWC or TRANSPOSE_NHWC2NCHW.

Returns

None

Example

  • The following example transposes the [16,16] matrix of the half type by using a common API.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    #include "kernel_operator.h"
    
    class KernelTranspose {
    public:
        __aicore__ inline KernelTranspose() {}
        __aicore__ inline void Init(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
        {
            srcGlobal.SetGlobalBuffer((__gm__ half*)src);
            dstGlobal.SetGlobalBuffer((__gm__ half*)dstGm);
    
            pipe.InitBuffer(inQueueSrc, 1, srcDataSize * sizeof(half));
            pipe.InitBuffer(outQueueDst, 1, dstDataSize * sizeof(half));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.AllocTensor<half>();
            AscendC::DataCopy(srcLocal, srcGlobal, srcDataSize);
            inQueueSrc.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<half> srcLocal = inQueueSrc.DeQue<half>();
            AscendC::LocalTensor<half> dstLocal = outQueueDst.AllocTensor<half>();
    
            AscendC::Transpose<half>(dstLocal, srcLocal);
    
            outQueueDst.EnQue<half>(dstLocal);
            inQueueSrc.FreeTensor(srcLocal);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<half> dstLocal = outQueueDst.DeQue<half>();
            AscendC::DataCopy(dstGlobal, dstLocal, dstDataSize);
            outQueueDst.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrc;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> outQueueDst;
    
        AscendC::GlobalTensor<half> srcGlobal, dstGlobal;
        int srcDataSize = 256;
        int dstDataSize = 256;
    };
    
    extern "C" __global__ __aicore__ void transpose_kernel(__gm__ uint8_t* src, __gm__ uint8_t* dstGm)
    {
        KernelTranspose op;
        op.Init(src, dstGm);
        op.Process();
    }
    
    Input (src_gm):
    [[  0.   1.   2.   3.   4.   5.   6.   7.   8.   9.  10.  11.  12.  13.
       14.  15.]
     [ 16.  17.  18.  19.  20.  21.  22.  23.  24.  25.  26.  27.  28.  29.
       30.  31.]
     [ 32.  33.  34.  35.  36.  37.  38.  39.  40.  41.  42.  43.  44.  45.
       46.  47.]
     [ 48.  49.  50.  51.  52.  53.  54.  55.  56.  57.  58.  59.  60.  61.
       62.  63.]
     [ 64.  65.  66.  67.  68.  69.  70.  71.  72.  73.  74.  75.  76.  77.
       78.  79.]
     [ 80.  81.  82.  83.  84.  85.  86.  87.  88.  89.  90.  91.  92.  93.
       94.  95.]
     [ 96.  97.  98.  99. 100. 101. 102. 103. 104. 105. 106. 107. 108. 109.
      110. 111.]
     [112. 113. 114. 115. 116. 117. 118. 119. 120. 121. 122. 123. 124. 125.
      126. 127.]
     [128. 129. 130. 131. 132. 133. 134. 135. 136. 137. 138. 139. 140. 141.
      142. 143.]
     [144. 145. 146. 147. 148. 149. 150. 151. 152. 153. 154. 155. 156. 157.
      158. 159.]
     [160. 161. 162. 163. 164. 165. 166. 167. 168. 169. 170. 171. 172. 173.
      174. 175.]
     [176. 177. 178. 179. 180. 181. 182. 183. 184. 185. 186. 187. 188. 189.
      190. 191.]
     [192. 193. 194. 195. 196. 197. 198. 199. 200. 201. 202. 203. 204. 205.
      206. 207.]
     [208. 209. 210. 211. 212. 213. 214. 215. 216. 217. 218. 219. 220. 221.
      222. 223.]
     [224. 225. 226. 227. 228. 229. 230. 231. 232. 233. 234. 235. 236. 237.
      238. 239.]
     [240. 241. 242. 243. 244. 245. 246. 247. 248. 249. 250. 251. 252. 253.
      254. 255.]]
    
    Output (dst_gm):
    [[  0.  16.  32.  48.  64.  80.  96. 112. 128. 144. 160. 176. 192. 208.
      224. 240.]
     [  1.  17.  33.  49.  65.  81.  97. 113. 129. 145. 161. 177. 193. 209.
      225. 241.]
     [  2.  18.  34.  50.  66.  82.  98. 114. 130. 146. 162. 178. 194. 210.
      226. 242.]
     [  3.  19.  35.  51.  67.  83.  99. 115. 131. 147. 163. 179. 195. 211.
      227. 243.]
     [  4.  20.  36.  52.  68.  84. 100. 116. 132. 148. 164. 180. 196. 212.
      228. 244.]
     [  5.  21.  37.  53.  69.  85. 101. 117. 133. 149. 165. 181. 197. 213.
      229. 245.]
     [  6.  22.  38.  54.  70.  86. 102. 118. 134. 150. 166. 182. 198. 214.
      230. 246.]
     [  7.  23.  39.  55.  71.  87. 103. 119. 135. 151. 167. 183. 199. 215.
      231. 247.]
     [  8.  24.  40.  56.  72.  88. 104. 120. 136. 152. 168. 184. 200. 216.
      232. 248.]
     [  9.  25.  41.  57.  73.  89. 105. 121. 137. 153. 169. 185. 201. 217.
      233. 249.]
     [ 10.  26.  42.  58.  74.  90. 106. 122. 138. 154. 170. 186. 202. 218.
      234. 250.]
     [ 11.  27.  43.  59.  75.  91. 107. 123. 139. 155. 171. 187. 203. 219.
      235. 251.]
     [ 12.  28.  44.  60.  76.  92. 108. 124. 140. 156. 172. 188. 204. 220.
      236. 252.]
     [ 13.  29.  45.  61.  77.  93. 109. 125. 141. 157. 173. 189. 205. 221.
      237. 253.]
     [ 14.  30.  46.  62.  78.  94. 110. 126. 142. 158. 174. 190. 206. 222.
      238. 254.]
     [ 15.  31.  47.  63.  79.  95. 111. 127. 143. 159. 175. 191. 207. 223.
      239. 255.]]
  • The following example transposes [N,C,H,W] to [N,H,W,C] of the half type by using an enhanced API.
     1
     2
     3
     4
     5
     6
     7
     8
     9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    #include "kernel_operator.h"
    
    template <typename T>
    class Kernel4dTrans {
    public:
        __aicore__ inline Kernel4dTrans() {}
        __aicore__ inline void Init(__gm__ uint8_t *srcGm, __gm__ uint8_t *dstGm)
        {
            inputSize = N * C * H * W;
            tmpBufferSize = (C + 2) * 16 * 16;
            srcGlobal.SetGlobalBuffer((__gm__ T *)srcGm);
            dstGlobal.SetGlobalBuffer((__gm__ T *)dstGm);
            pipe.InitBuffer(inQueueSrcVecIn, 1, inputSize*sizeof(T));
            pipe.InitBuffer(inQueueSrcVecOut, 1, inputSize*sizeof(T));
            pipe.InitBuffer(tmpQueue, 1, tmpBufferSize * sizeof(T));
        }
        __aicore__ inline void Process()
        {
            CopyIn();
            Compute();
            CopyOut();
        }
    private:
        __aicore__ inline void CopyIn()
        {
            AscendC::LocalTensor<T> srcLocal = inQueueSrcVecIn.AllocTensor<T>();
            AscendC::DataCopy(srcLocal, srcGlobal, inputSize);
            inQueueSrcVecIn.EnQue(srcLocal);
        }
        __aicore__ inline void Compute()
        {
            AscendC::LocalTensor<T> srcLocal = inQueueSrcVecIn.DeQue<T>();
            AscendC::LocalTensor<T> dstLocal = inQueueSrcVecOut.AllocTensor<T>();
            AscendC::LocalTensor<uint8_t> stackBuffer = tmpQueue.AllocTensor<uint8_t>();
    
            AscendC::TransposeParamsExt transposeParams;
            transposeParams.nSize = N;
            transposeParams.cSize = C;
            transposeParams.hSize = H;
            transposeParams.wSize = W;
            transposeParams.transposeType = transposetype;
            AscendC::Transpose(dstLocal, srcLocal, stackBuffer, transposeParams);
            inQueueSrcVecOut.EnQue<T>(dstLocal);
            inQueueSrcVecIn.FreeTensor(srcLocal);
            tmpQueue.FreeTensor(stackBuffer);
        }
        __aicore__ inline void CopyOut()
        {
            AscendC::LocalTensor<T> dstLocal = inQueueSrcVecOut.DeQue<T>();
            AscendC::DataCopy(dstGlobal, dstLocal, inputSize);
            inQueueSrcVecOut.FreeTensor(dstLocal);
        }
    private:
        AscendC::TPipe pipe;
        AscendC::TQue<AscendC::QuePosition::VECIN, 1> inQueueSrcVecIn;
        AscendC::TQue<AscendC::QuePosition::VECOUT, 1> inQueueSrcVecOut;
        AscendC::TQue<AscendC::QuePosition::VECCALC, 1> tmpQueue;
    
        AscendC::GlobalTensor<T> srcGlobal;
        AscendC::GlobalTensor<T> dstGlobal;
        uint32_t N = 3;
        uint32_t C = 3;
        uint32_t H = 2;
        uint32_t W = 8;
        uint32_t inputSize, tmpBufferSize;
        AscendC::TransposeType transposetype = AscendC::TransposeType::TRANSPOSE_NCHW2NHWC;
    };
    
    extern "C" __global__ __aicore__ void transpose_kernel(__gm__ uint8_t* srcGm, __gm__ uint8_t* dstGm)
    {
        Kernel4dTrans<half>op;
        op.Init(srcGm, dstGm);
        op.Process();
    }