cube/vector

The program performance is determined by the execution time of the compute core (kernel). You can check whether the program is bound in the following ways:

  1. Run the npu-smi info command to check the usage. If the usage is high, the program is probably bound. If the usage is low when a single inference is performed, the usage may exceed 80% (generally considered high) when multiple inferences are performed concurrently. The value obtained in this method is the ratio of the cycle on the Cube core to the cycle converted from the clock frequency, excluding the proportion of the Vector Core. Therefore, this value is only a preliminary check. You need to calculate the cycle in the profiling.

    The following figure shows how to calculate the utilization through profiling.

  2. Filter by operator type. For example, if AI_CORE is selected, the utilization of the cube is calculated.
  3. Sort by start time and calculate the difference between the first and last timestamps. The difference is the total time consumption of the entire segment.
  4. The total number of cycles of the cube is the sum of aic_total_cycles.
  5. The utilization is ③/(②/1000000 x 20 x 1650 x 1000000). The unit of ② is μs. Therefore, divide ② by 1000000 to convert it to seconds. 20 is the number of AI Cores of the chip. 1650 is the frequency of the current chip, in MHz/s. Therefore, multiply 1650 by 1000000.

    The calculated utilization is 78.9% of the overall cube utilization.

  6. The proportion of each process during the calculation can be calculated based on the time. For example, the time consumption proportion of MTE2 is 70.18% of the total time consumption of AI Core. It can be inferred that the time consumption of memory movement on AI Core accounts for a large proportion.
Optimization methods in the cube/vector bound scenario:
  • Graph optimization, for example, torch.compile or XLA of TensorFlow.
  • Operator fusion can reduce the operator startup overhead and intermediate read/write of operators.
  • Data type. For example, if the data type is changed from float32 to HF32, the computation amount decreases for the same data (which may affect the accuracy. Therefore, the accuracy test must be performed). The HF32 data type takes effect only for Conv and Matmul operators.