Prerequisites

CANN, ATB Models, and msModelSlim have been installed in the environment. For details, see the MindIE Installation Guide.

The following installation path is used as an example:

Install ATB Models and initialize the ATB Models environment variable. Initializing the ${ATB_SPEED_HOME_PATH} environment variable is contained in the set_env.sh script of the model repository. Therefore, sourcing the set_env.sh script from the model repository also initializes the ${ATB_SPEED_HOME_PATH} environment variable.

The common capabilities of ATB Models support the following quantization modes:

W8A8
W4A8 hybrid quantization
W8A16
W8A8SC sparse quantization
W16A16SC sparse quantization
KV cache INT8
FA3 Quantization
Anti-Outlier processing
Attention quantization
PDMIX quantization

Each model supports different quantization modes. For details, see the feature support matrix in the README file of the model in the ${ATB_SPEED_HOME_PATH}/examples/models/ directory.

The quantization feature supports unquantization. That is, some weights in the model are not quantized, and the original floating-point weights are used for MatMul computation. Linear-level unquantization is supported. Unquantization can improve the precision of quantized weights. The unquantization configuration varies depending on the quantization mode of each model. For details, see the configuration in the quantized weight script generated for each model.

For the LLaMA example, rollback layers in different quantization modes are defined in ${ATB_SPEED_HOME_PATH}/examples/models/llama/generate_quant_weight.sh. In the W8A16 quantization scenario, the rollback layer is not set (lmhead is rolled back by default). In other quantization scenarios, all down layers are rolled back.

get_down_proj_disable_name() {
    local num_layer=$1
    local disable_names=""
    for ((i=0; i<$num_layer; i++)); do
        disable_names="$disable_names model.layers.$i.mlp.down_proj"
    done
    echo "$disable_names"
}

Parent topic: Quantization