Model Quantization

Model quantization is a model compression technique that reduces the storage and computation requirements of models by reducing model weights and the precision represented by activated numerical values. Quantization tools usually convert high-bit floating-point numbers into low-bit fixed-point numbers, directly reducing the volume of model weights.

The input of the model quantization tool is a model and data that can run properly, and the output is an available quantization weight and quantization factor.

Prerequisite

  • The development environment has been deployed. You can deploy the development environment using a Docker image by referring to "Installing MindIE > containerized deployment and image creation > Pulling images "in MindIE Installation Guide.
  • The msit tool package has been installed. Install the msIT by referring to the msIT installation guide. The source code installation mode is recommended.
  • The msModelSlim software has been installed. Download the msModelSlim software package from the msModelSlim website.

Procedure

  1. Click Download to download the Llama-3.1-8B-Instruct weight and model files to the local PC.
    Figure 1 Downloading the files to a local PC
  2. Run the following command to go to the Llama directory:
    cd ${HOME}/msit/msmodelslim/example/Llama

    HOME indicates the custom path for installing msit.

  3. Execute the quantization script to generate the quantization weight file and save the file to the custom storage path. The example command is the W8A16 quantization command.
    python3 quant_llama.py --model_path ${model_path} --save_directory ${save_directory} --device_type npu --w_bit 8 --a_bit 16 

    In the preceding command, model_path indicates the path for saving the downloaded model file, and save_directory indicates the path for saving the generated quantized weight file. For other model quantization cases, refer to the LLAMA Quantization Cases.

  4. After the quantization is complete, the result is shown in Figure 2. The .safetensors file size is compressed from 15.1 GB to 8.5 GB.
    Figure 2 Quantized result
  5. The generated w8a16 quantization weight file is as follows.
    ├── config.json   # Configuration file.
    ├── generation_config.json   # Configuration file.
    ├── quant_model_description_w8a16.json   # Weight description file after w8a16 quantization.
    ├── quant_model_weight_w8a16.safetensors # Weight file after w8a16 quantization.
    ├── tokenizer.json   # Tokenizer of the model file.
    ├── tokenizer_config.json   # Tokenizer configuration file of the model file.