build_graph

Function

Creates a text index and generates a knowledge graph of the corresponding text.

Prototype

def build_graph(lang, pad_token, conceptualize, **kwargs)

Parameters

Parameter

Data Type

Required/Optional

Description

lang

Lang

Optional

Corpus language. The default value is Lang.EN (English corpus).

pad_token

String

Optional

LLM's padding token, which is an empty character by default. The value range is [0, 255].

conceptualize

Bool

Optional

Whether to conceptualize nodes. The default value is False.

kwargs

Dict

Optional

Extended parameters:

  • max_workers: number of threads for building a knowledge graph. The default value is 5.
  • batch_size: batch size for node vectorization, retrieval, and other operations. The default value is 32.
  • top_k: number of most similar concepts returned by vector retrieval during clustering of graph concepts. The value ranges from 1 to 100 and defaults to 5.
  • threshold: vector similarity threshold. Similarity results below this value will be filtered. The value ranges from 0.0 to 1.0 and defaults to 0.3.
  • triple_instructions: instruction used to guide the LLM to extract relationships from documents. The value is of the dictionary type. The default value is None. The value is used based on the language (TRIPLE_INSTRUCTIONS_CN for Chinese and TRIPLE_INSTRUCTIONS_EN for English). You can provide a dictionary to override the default extraction instructions. The dictionary must contain the following keys:
    • entity_relation: The value defines the instruction for entity relationship extraction. The value is a string of 1 to 1,048,576 characters.
    • event_entity: The value defines the instruction for event entity extraction. The value is a string of 1 to 1,048,576 characters.
    • event_relation: The value defines the instruction for event relationship extraction. The value is a string of 1 to 1,048,576 characters.

    The value of each key defines the instruction of a specific extraction task.

  • conceptualizer_prompts: prompt for guiding the LLM to perform conceptualization. The value is of the dictionary type. The default value is None. You can provide a dictionary to override the default conceptualization prompt. The dictionary must contain the following keys:
    • entity: The value defines the prompt for conceptualizing entities in the graph. The value is a string of 1 to 1,048,576 characters. If conceptualizer_prompts is set to None, the value is determined by the used language (ENTITY_PROMPT_CN for Chinese and ENTITY_PROMPT_EN for English).
    • event: The value defines the prompt for conceptualizing events in the graph. The value is a string of 1 to 1,048,576 characters. If conceptualizer_prompts is set to None, the value is determined by the used language (EVENT_PROMPT_CN for Chinese and EVENT_PROMPT_EN for English).
    • relation: The value defines the prompt for conceptualizing relationships in the graph. The value is a string of 1 to 1,048,576 characters. If conceptualizer_prompts is set to None, the value is determined by the used language (RELATION_PROMPT_CN for Chinese and RELATION_PROMPT_EN for English).

Return Value

None.

After this method is executed, the following process files are generated in work_dir.

Table 1

File

Description

"{graph_name}.json"

Stores graphs. If graph_type is set to networkx, graphs are loaded using this file during retrieval.

"{graph_name}_relations.json"

Stores entity relationship information.

"{graph_name}_concepts.json"

Stores concept information.

"{graph_name}_synset.json"

Stores the category information after concept clustering.

"{graph_name}_node_vectors.index"

Stores vector indexes of entities.

"{graph_name}_concept_vectors.index"

Stores vector indexes of concepts.