Model Quantization

  1. Click Download to download the Llama-3.1-8B-Instruct weight and model files to the local PC.
    Figure 1 Downloading the files to a local PC
  2. Run the following command to go to the Llama directory:
    cd ${HOME}/msmodelslim/example/Llama

    HOME indicates the custom path for installing msit.

  3. Execute the quantization script to generate the quantization weight file and save the file to the custom storage path. The following example command is the W8A16 quantization command.
    python3 quant_llama.py --model_path ${model_path} --save_directory ${save_directory} --device_type npu --w_bit 8 --a_bit 16 

    In the preceding command, model_path indicates the path for saving the downloaded model file, and save_directory indicates the path for saving the generated quantized weight file. For other model quantization cases, refer to the LLAMA Quantization Cases.

    If the quantized weight files need to be deployed on MindIE 2.1.RC1 or earlier versions, add the --mindie_format parameter when running the quantization command:

    python3 quant_llama.py --model_path ${model_path} --save_directory ${save_directory} --device_type npu --w_bit 8 --a_bit 16 --mindie_format
  4. After the quantization is complete, the result is shown in Figure 2. The .safetensors file size is compressed from 15.1 GB to 8.5 GB.
    Figure 2 Quantized result
  5. The generated W8A16 quantization weight file is as follows.
    ├── config.json                          # Configuration file
    ├── generation_config.json               # Configuration file
    ├── quant_model_description.json         # Weight description file after W8A16 quantization
    ├── quant_model_weight_w8a16.safetensors # Weight file after W8A16 quantization
    ├── tokenizer.json                       # Tokenizer of the model file
    ├── tokenizer_config.json                # Tokenizer configuration file of the model file