Operator Basics

A deep learning algorithm consists of multiple compute units referred to as operators. In network models, an operator describes the compute logic of a layer, such as the convolution layer and the fully-connected (FC) layer that multiplies the input by a weight matrix.

In mathematics, an operator is generally a mapping from a function space to another function space (O: X→Y).

In a broad sense, an operation performed on any function may be considered as an operator, for example, a differential operator or an indefinite integral operator.

This section introduces some basic operator terms.

Operator Name

An operator's name identifies the operator on a network, and as such it must be unique on the network. An example network has operators Conv1, Pool1, and Conv2. Conv1 and Conv2 are of the same convolution type, and each indicates a convolution operation.

Figure 1 Example network topology

Operator Type

Each operator on the network matches the operator implementation file based on operator type, and operators of the same type will also use the same implementation logic. A network may include the operator composite of the same type. For example, the Conv1 and Conv2 operators in the preceding figure are both convolution operators.

Tensor

A tensor is a container for operator computing data. A tensor descriptor (TensorDesc) describes the data in a tensor. Table 1 describes the attributes of the TensorDesc data structure.

**Table 1** TensorDesc attributes
Attribute	Definition
name	Indexes a tensor and must be unique.
shape	Specifies the shape of a tensor. For example, (10,), (1024, 1024), or (2, 3, 4). For details, see Shape. Default value: none Format: (i1, i2, ..., in), where i1 to in are positive integers. NOTE: Due to the restrictions of the AI Core operator compiler, empty tensors with shape containing 0 are not supported and should be avoided during development.
dtype	Specifies the data types of a tensor object. Default value: none Value range: float16, float32, int8, int16, int32, uint8, uint16, bfloat16, bool, and more NOTE: Different computation operations support different data types. For details, see API Reference.
format	Defines the tensor format. For details, see Format.

Shape

The shape of a tensor is described in the format (D0, D1, ..., Dn – 1), where D0 to Dn are positive integers.

For example, the shape (3, 4) indicates a 3 x 4 matrix (3 rows and 4 columns), where the first dimension has three elements, and the second dimension has four elements.

The first element of a shape indicates the element count in the outermost square brackets of the tensor, and the second element indicates the element count in the second left square bracket, and so on. See the following examples.

**Table 2** Tensor shape examples
Tensor	Shape	Description
1	(0,)	0-dimensional tensor, which is also a scalar
[1, 2, 3]	(3,)	1-dimensional tensor
[[1, 2],[3, 4]]	(2, 2)	2-dimensional tensor
[[[1, 2],[3, 4]], [[5, 6],[7, 8]]]	(2, 2, 2)	3-dimensional tensor

The tensor shape has the following physical meanings:

A tensor with shape (4, 20, 20, 3) indicates four 20 x 20 (corresponding to the two 20s in the shape) pictures (corresponding to 4 in the shape), each of whose pixels contains the red, green, and blue color components (corresponding to 3 in the shape).

Figure 2 Diagram

In programming, the shape can be understood simply as a loop of each layer of a tensor. For example, for tensor A with shape (4, 20, 20, 3), the loop statement is as follows.

produce A {
  for (i, 0, 4) {
    for (j, 0, 20) {
      for (p, 0, 20) {
        for (q, 0, 3) {
          A[((((((i*20) + j)*20) + p)*3) + q)] = a_tensor[((((((i*20) + j)*20) + p)*3) + q)]
        }
      }
    }
  }
}

Axis

An axis is denoted by the index of a tensor dimension. For a 2D tensor with five rows and six columns — shape (5, 6) — axis 0 represents the first dimension of the tensor (the rows) and axis 1 represents the second dimension (the columns).

For example, for tensor [[[1, 2],[3, 4]], [[5, 6],[7, 8]]] with shape (2, 2, 2), axis 0 represents data in the first dimension, that is, matrices [[1, 2],[3, 4]] and [[5, 6],[7, 8]], axis 1 represents data in the second dimension, that is, arrays [1, 2], [3, 4], [5, 6], and [7, 8], and axis 2 indicates the data in the third dimension, that is, numbers 1, 2, 3, 4, 5, 6, 7, and 8.

A negative axis is interpreted as indexing from the end.

The axes of an n-dimensional tensor include 0, 1, 2, ..., and n – 1.

Figure 3 Axis diagram

Weight

The input data is multiplied by a weight value in a compute unit. For example, for a two-input operator, an associated weight value is allocated to each of the inputs. Generally, more important data is assigned with a greater weight value. As a result, features indicated by data with the weight value zero can be ignored.

As shown in Figure 4, in the compute unit, input X1 is multiplied by its associated weight W1 (X1 * W1).

Figure 4 Weight computation example

Bias

A bias is another linear component to be applied to the input data, in addition to a weight. The bias is added to the product of the input and its weight.

As shown in Figure 5, in the compute unit, input X1 is multiplied by its associated weight W1 and then added to its associated bias B1, producing a result (X1 x W1 + B1).

Figure 5 Bias computation example

Broadcast

Broadcasting is the process of ensuring that arrays with different shapes become compatible with arithmetic operations. TBE requires that the size of each dimension of the array to be broadcast should be 1 or the same as that of the target shape. TBE performs broadcasting on one-element dimensions only.

For example, if the shape of an original array is (2, 1, 64) and the target shape is (2, 128, 64), the array can be broadcast to a new array with the shape (2, 128, 64).

The compute APIs of TBE do not support automatic broadcasting, and the two input tensors must be of the same shape. Consequently, you need to compute the target shape and broadcast the input tensors before initiating arithmetic operations.

For example, before adding tensor A with shape (4, 3, 1, 5) to tensor B with shape (1, 1, 2, 1), you must perform the following steps:

Compute the target shape, that is, shape C.
Figure 6 Example 1

Use the larger of the corresponding dimensions as the target dimensions (4, 3, 2, 5).
Call the broadcast API to broadcast Tensor A and Tensor B to the target shape C.

The size of each dimension of the tensor to be broadcast must be either 1 or the same as that of the target shape.

If this rule is not met, the broadcast cannot be performed. As shown in the following figure, the size of the fourth dimension of Tensor B' is 3, which is not 1 and is not equal to the size of the fourth dimension of Tensor C (which is 5). Therefore, the broadcast cannot be performed.
Call the compute API to sum Tensor A and Tensor B.

Reduction

Reduction is an operation that removes one or more dimensions from a tensor by performing certain operations across those dimensions. Reduction operators include Sum, Min, Max, All, and Mean in TensorFlow and Reduction in Caffe. The latter is used as an example.

Attributes of the Reduction operator

ReductionOp: operation type. Four operation types are supported.

**Table 3** Operation types supported by the Reduction operator
Operator Type	Description
SUM	Computes the sum of elements across specified dimensions of a tensor.
ASUM	Computes the sum of absolute values of elements across specified dimensions of a tensor.
SUMSQ	Computes the sum of squares of elements across specified dimensions of a tensor.
MEAN	Computes the mean values of elements across specified dimensions of a tensor.

axis: first dimension to reduce. The value range is [–N, N – 1].
For example, for an input tensor with shape (5, 6, 7, 8):
- If axis = 3, the shape of the output tensor is (5, 6, 7).
- If axis = 2, the shape of the output tensor is (5, 6, 8).
- If axis = 1, the shape of the output tensor is (5, 7, 8).
- If axis = 0, the shape of the output tensor is (6, 7, 8).
coeff: a scalar for the scaling factor. The value 1 indicates that the output is not scaled.

The following example describes how to reduce dimensions:

If axis = 0, compute the sum of elements across the rows, obtaining [2, 2, 2]. That is, the 2D matrix is reduced to 1D.
If axis = 1, compute the sum of elements across the columns.
If axis = [0, 1], perform reduction with axis = 0 to obtain [2, 2, 2] before reduction with axis = 1 to obtain 6, resulting in 0D.
If axis = [], reduction is not performed and the dimensions are retained.
If axis is NULL, all dimensions are reduced, resulting in a 0D scalar.

Transformation Operators

Transformation operators transform tensor attributes such as the data type and format to streamline the computation of upstream and downstream operators in a graph.

Since the graph has gone through a series of processing operations, such as offload, optimization, and fusion, the tensor attributes of the nodes may have changed. In this case, transformation operators are needed. During the network topology building, FE automatically inserts transformation operators, saving manual conversion workload.

Transformation operators transform mainly formats and data types.

Format conversion
In the TensorFlow network, the format for Conv2D and MatMul operators is HWCN, and that for other operators is NCHW or NHWC.

In the AI Core, the format for Conv2D and MatMul operators is FZ, and that for other operators is NC1HWC0.

The format used by TensorFlow operators of AI CPU is the same as that used in the TensorFlow network.

During network execution, different operators are executed by different modules. Therefore, format conversion is required. Examples are as follows.

Figure 7 Format conversion
Data type cast
Some operators on Ascend AI Processor support only float16. When the data type of the input variable is float32, transformation operators need to be inserted to convert data types.

To optimize variable update, a variable operator supporting float16 is added to the training network. The usage example is as follows.

Figure 8 Data type cast

The float16 variable is directly used for forward and backward propagation.

During gradient update, both float32 and float16 variables are involved in the calculation, and data of type float32 is used as the benchmark.

Parent topic: Background Knowledge