MindStudio Advisor Overview

MindStudio Advisor is a tool used to locate top performance tuning issues of models and operators, identify and analyze bottlenecks, and output tuning suggestions, thereby improving the development efficiency.

To use the following functions, prepare input data according to Input Data Description. Roofline model-based operator bottleneck identification and tuning suggestion, timeline-based AI CPU operator tuning, operator fusion recommendation, and TransData operator identification are performed according to MindStudio Advisor Entry and operator tuning analysis is performed according to Operator Project Entry.

Roofline Model-based Operator Bottleneck Identification and Tuning Suggestion

As shown in Figure 1, the unit of horizontal coordinates is FLOP/Byte, indicating the operational intensity, that is, the number of operations for each 1-byte data movement. A larger value indicates a higher memory movement utilization rate. The unit of vertical coordinates is FLOP/s, indicating the operation speed. A larger value indicates faster operations.

A larger horizontal coordinate value indicates more operations each time 1-byte data is moved. However, the number of operations per second cannot exceed the upper limit of the hardware performance π, that is, the green line in the figure. When the horizontal coordinate decreases to a value below a certain threshold (the blue point I_max in the figure), the moved data cannot enable the hardware to provide computing power at the maximum attainable performance. In this case, the vertical coordinate is β·x, in which β indicates the hardware bandwidth, that is, the red slash in the figure.

The blue point divides the Roofline model into two parts. The red part is the memory bound, and the green part is the compute bound. The closer the actual working point is to the red or green line, the severer the bottleneck is.

Figure 1 Roofline model

Timeline-Based Tuning of AI CPU Operators

An AI CPU is the compute unit of Ascend AI Processor. Due to its own bottlenecks, operators running on AI CPUs affect the model execution time. Therefore, tuning of AI CPU operators needs special focus.

During model development and conversion, AI CPU operators are introduced, whose serial execution waiting will affect model execution.

According to the analysis of timelines, performance bottlenecks are generally caused by serial execution waiting of operators. Currently, the timeline information is displayed by stream interval, so the serial/parallel relationships between operators cannot be directly displayed.

As shown in Figure 2, model execution serially waits for AI CPU operator execution in Task1 (PTCopy) in the AI CPU timeline, in which case the bottleneck analysis model is required to identify this type of bottlenecks. The operation time of Task 2 in the AI CPU timeline is hidden in the operation time of AI Core, in which case the execution of AI CPU operators can be ignored.

Figure 2 AI CPU operator execution

The timeline-based AI CPU operator tuning uses the Profiling Task Scheduler (task_time_xxxx.json) data file and offline OM model file as the input data to automatically identify AI CPU operators that are executed in serial mode and provide tuning suggestions to improve the overall model performance.

Operator Fusion Recommendation

Operator fusion recommendation includes UB operator fusion, input-layer operator fusion, and L2 operator fusion (dynamic batch tiling).

UB operator fusion

During model conversion, operator fusion may fail due to various reasons, for example, the data calculated by the operators exceeds the UB size or the current operator fusion pattern is not covered. If the OM model is very complex, for example, it contains thousands of compute nodes, it takes a long time to locate fusible operators using naked eyes. The rules of the model conversion tool cannot match all scenarios and some of the fusion patterns are missing.

For UB operator fusion recommendation, the offline OM model file is used as the input data. In this case, the operators that can be fused in the OM model can be automatically discovered, scenarios with missed operator fusion can be identified, and corresponding operator fusion suggestions can be provided.

Input-layer operator fusion

If the input channel count can be not 16-pixel or 32-pixel aligned for load3dv2, four-channel convolution is supported, which greatly reduces MTE2 and cube operations. In non-AIPP scenarios, the cast and trans_data operators are required at the input layer to cast the data type and data layout format of the images. When C0 is equal to 4, the performance of the trans_data operator deteriorates severely, as well as the entire network performance. However, in AIPP scenarios, AIPP+conv or AIPP+conv+maxpooling fusion is performed at the input layer of the network. Generally, the 16-channel convolution at the input layer takes about 10% of the time of the entire network. In this case, enabling the small channel mode greatly improves the network performance.

Input-layer operator fusion uses the offline OM model file as the input data to automatically discover the data preprocessing operators that can be tuned in the OM model and enable AIPP to improve performance.

L2 fusion (dynamic batch tiling)

On a network, some operator layers adopt L2 fusion. The data volume (including input and output data) is large and exceeds the L2 space, which triggers the DDR (collects statistics on the read and write bandwidths and displays the result in a table in Analysis Summary) writeback operation. However, the DDR bandwidth is much lower than that of L2. As a result, the MTE2 bound becomes serious and pipeline problems may occur.

Take resnet50_int8_8batch as an example:

Theoretically, the single-operator computation time of 8batch is less than or equal to twice the single-operator computation time of 4batch. However, if the operators at each layer of 8batch adopt L2 fusion, the data volume increases. As a result, the L2 cache space becomes insufficient, DDR writeback occurs, and the operator performance deteriorates.

**Table 1** Computation performance comparison of 8batch_4batch operators
Operator	8batch Performance/μs	4batch Performance/μs	Multiples of 8batch/4batch
res2a_branch2c	200.63	38.284	5.24
res2b_branch2c	189.97	40.527	4.69
res2c_branch2c	147.84	38.533	3.84
res3a_branch1	74.00	31.109	2.38
Time consumed by network-wide operators	2031	954	-
Percentage of time consumed by network-wide operators	30.2%	15.5%	-

As shown in Table 1, the computation time of 8batch operators is much greater than twice that of the 4batch operators. The operator performance deteriorates severely.

Advisor analyzes operators at each layer in an OM model and reads op_summary.csv data to identify and output operators that adopt L2 fusion and are not performance bottlenecks. These operators need to be tiled to ensure that the data of operators at each layer does not exceed the L2 space, to prevent DDR writeback.

TransData Operator Identification

Format conversion in models is a key factor that affects the model performance. The current format conversion is introduced by the NPU cube unit. On the CV network, a large number of 4D and 5D conversions are performed. On the NLP network, a large number of 4D and NZ conversions are performed. A large number of TransData operators affect the model performance.

TransData operator identification introduces common TransData scenarios through analysis models, identifies the performance bottleneck of conversion operators, analyzes the bottleneck from the operator layer, adaptation layer, and model layer, and selects a proper tuning solution to reduce the calls to TransData.

Operator Tuning Analysis

Main operator tuning scenario: The operator performance does not meet requirements after operator development is complete or during the running of the entire network. Operator tuning has high requirements on developers. Developers need to understand the bottom layer and framework and have experience in operator tuning. Operator tuning analysis helps developers quickly locate performance bottlenecks and provides corresponding tuning methods, improving the operator tuning efficiency.

It analyzes the dump file generated during input operator simulation from four dimensions: vector operation, scalar operation, pipeline interruption, and memory bound, and provides the analyzed data and corresponding tuning suggestions.

Parent topic: MindStudio Advisor