Model Quantization
Model quantization is a model compression technique that reduces the storage and computation requirements of models by reducing model weights and the precision represented by activated numerical values. Quantization tools usually convert high-bit floating-point numbers into low-bit fixed-point numbers, directly reducing the volume of model weights.
The input of the model quantization tool is a model and data that can run properly, and the output is an available quantization weight and quantization factor.
Prerequisite
- The development environment has been deployed. You can deploy the development environment using a Docker image by referring to "Installing MindIE > containerized deployment and image creation > Pulling images "in MindIE Installation Guide.
- The msit tool package has been installed. Install the msIT by referring to the msIT installation guide. The source code installation mode is recommended.
- The msModelSlim software has been installed. Download the msModelSlim software package from the msModelSlim website.
Procedure
- Click Download to download the Llama-3.1-8B-Instruct weight and model files to the local PC.Figure 1 Downloading the files to a local PC
- Run the following command to go to the Llama directory:
cd ${HOME}/msit/msmodelslim/example/LlamaHOME indicates the custom path for installing msit.
- Execute the quantization script to generate the quantization weight file and save the file to the custom storage path. The example command is the W8A16 quantization command.
python3 quant_llama.py --model_path ${model_path} --save_directory ${save_directory} --device_type npu --w_bit 8 --a_bit 16In the preceding command, model_path indicates the path for saving the downloaded model file, and save_directory indicates the path for saving the generated quantized weight file. For other model quantization cases, refer to the LLAMA Quantization Cases.
- After the quantization is complete, the result is shown in Figure 2. The .safetensors file size is compressed from 15.1 GB to 8.5 GB.
- The generated w8a16 quantization weight file is as follows.
├── config.json # Configuration file. ├── generation_config.json # Configuration file. ├── quant_model_description_w8a16.json # Weight description file after w8a16 quantization. ├── quant_model_weight_w8a16.safetensors # Weight file after w8a16 quantization. ├── tokenizer.json # Tokenizer of the model file. ├── tokenizer_config.json # Tokenizer configuration file of the model file.
