Matmul Features

In addition to the basic computing capabilities described in the preceding Basics and Operator Implementation, Matmul cube programming provides processing capabilities and multiple functions applicable to different scenarios. The following table lists the specific scenarios and functions. For details, see the following sections.

**Table 1** Matmul features
Category	Feature Description	Overview
Function Implementation	Multi-core aligned tiling	In a multi-core scenario, matrix data can be tiled along the M, N, and K axes. This function is used to implement parallel matrix multiplication across cores in the alignment scenario where M, N, and K can be exactly tiled by singleCoreM, singleCoreN, and singleCoreK, respectively.
	Multi-core non-aligned tiling	In a multi-core scenario, matrix data can be tiled along the M, N, and K axes. This function is used to process the non-alignment scenario, or in other words, tail block scenario, where M, N, or K cannot be exactly tiled by singleCoreM, singleCoreN, or singleCoreK, respectively.
	Asynchronous scenario processing	In the MIX scenario (including matrix computation and vector computation), other computations can be performed first without waiting for the completion of matrix multiplication.
	Custom data movement	Data movement functions are customized before and after matrix multiplication. This function allows you to customize the process of moving the left matrix A and right matrix B from the global memory to A1 and B1, respectively, and the process of moving the output matrix C from CO1 to the global memory.
	Channel split of matrix multiplication outputs	Channel split of matrix multiplication outputs, or ChannelSplit, refers to matrix C, which is output in float data type and NZ data format, being stored in 16 × 8 fractals.
	General Matrix-Vector Multiplication	General Matrix-Vector Multiplication, or GEMV, refers to the scenario in which M = 1 and K > 1 in matrix multiplication. In other words, matrix multiplication is performed on the left matrix A with shape (1, K).
	Upper/Lower triangular matrix multiplication	The computation of elements in the lower or upper triangular position of a matrix is ignored, and matrix multiplication of elements in the upper or lower triangular position of the matrix is performed.
	Matrix multiplication with TSCM inputs	Matrix multiplication is performed on the left matrix A or right matrix B whose logical memory location is TSCM.
	N-direction alignment of matrix multiplication outputs	N-direction alignment of matrix multiplication outputs, also called ND_ALIGN outputs, refers to automatic padding and output of matrix C, which is in ND_ALIGN data format, for 32-byte alignment in the N direction.
	Partial output for a single matrix multiplication	Partial output of a single matrix multiplication, also called Partial Output, refers to the output of the computation result directly without accumulating the computation results in the K direction of a single core during matrix multiplication.
	Independent running mechanism of AIC and AIV	The independent running mechanism of AIC and AIV is also called the dual-master mode. In the MIX scenario (including matrix computation and vector computation), the AIC and AIV cores run independently without depending on the message mechanism.

**Table 2** Matmul features
Category	Feature Description	Overview
Function implementation	Quantization and dequantization of matrix multiplication outputs	When the matrix multiplication result is moved from CO1 to the global memory, data quantization or dequantization is performed on the matrix elements.
Function implementation	4:2 sparse matrix multiplication	4:2 sparse matrix multiplication, also called Sparse Matmul, refers to matrix multiplication performed on the sparse left matrix A and the right matrix B that is a matrix after 4:2 densification.

**Table 3** BatchMatmul features
Category	Feature Description	Overview
Function implementation	Basic functions of Batch Matmul	Batch Matmul provides the basic function of processing Matmul computations in batches. The IterateBatch API is called once to compute multiple matrices C with the size of singleCoreM × singleCoreN.
Function implementation	Batch Matmul reusing bias matrix	A bias matrix without the batch axis is reused for Matmul computation of each batch.

Parent topic: Feature Scenarios