Knowledge Base Building

Procedure

  1. Compile the retrieval operator to implement the retrieval function.
    cd $MX_INDEX_INSTALL_PATH/tools/ && python3 aicpu_generate_model.py -t <chip_type> && python3 flat_generate_model.py -d <dim> -t <chip_type>  && cp op_models/* $MX_INDEX_MODELPATH 
    • The MX_INDEX_INSTALL_PATH and MX_INDEX_MODELPATH variables have been configured in ~/.bashrc and do not need to be configured separately. For more details, see ~/.bashrc.
    • -d <dim> indicates the dimension of the embedding model after vectorization. Because the vector dimension of the acge_text_embedding model is 1024, set this parameter to -d 1024.
    • -t <chip_type> indicates the processor type. For the Atlas 300I Duo inference card, run the npu-smi info command on the server where the Ascend AI Processor is installed and then delete the last digit of Name. The obtained value is the value of <chip_type>. For the Atlas 800I A2 inference server, run the npu-smi info command on the server where the Ascend AI Processor is installed to obtain the value of Name. For the Atlas 800I A3 SuperPoD Server, run the npu-smi info -t board -i 0 -c 0 command to obtain the NPU Name information. 910_<NPU Name> is the value of <chip_type>.
  2. Create a domain-specific knowledge document.

    Create a gaokao.txt file in the UTF-8 format in the /home/HwHiAiUser directory. The file content is as follows:

    Composition Test of the 2024 National College Entrance Examination
    New Course Standard (I)
    Read the following materials and write a composition. (60 points)
    With the popularization of the Internet and artificial intelligence, more and more questions can be quickly answered. So, will we have fewer problems?
    How do you think about the above materials? Please write a composition no fewer than 800 words.
    Requirements: Select a proper angle and style to describe your opinions. Prepare your own title. Do not copy other articles, and do not disclose personal information.

    The training deadline of the selected model is before 2024. The model itself has not learned the knowledge related to the composition test of the 2024 National College Entrance Examination.

  3. Build a domain-specific knowledge base.

    Run rag_demo_knowledge.py by referring to the Demo and modify default parameters such as the file path and model path in the code as required. For details, refer to the README.md file.

    python3 rag_demo_knowledge.py --file_path "/path/to/gaokao.txt"
  4. Run the program to obtain the result.
    If the sample code can print the list of uploaded file names, the knowledge base has been successfully built.
    ['gaokao.txt']