evaluate

Function

Performs evaluation on a given evaluation dataset based on data in dictionary format. To display the logs printed by RAGAS, set the environment variable DISABLE_RAGAS_LOGGING to 0.

Prototype

def evaluate(metrics, dataset, language, prompts_path, show_progress)

Parameters

Parameter

Data Type

Required/Optional

Description

metrics

list[str]

Required

Set of evaluation metrics. For details, see Table 1.

The number of metrics in the set is limited to (0, 14]. The number of characters in the name of each metric is within [1, 50]. The metrics must be unique. If the answer_similarity metric is used, the key of the returned score is semantic_similarity.

dataset

Dict[str, Any]

Required

User evaluation dataset with a length range of [1, 4]. The dictionary format is as follows:

  • user_input: format: List[str]; list length range: [0, 128]; character range: [1, 1000000].
  • response: format: List[str]; list length range: [0, 128]; character range: [1, 1000000].
  • retrieved_contexts: format: List[List[str]]; length range of the outer list: [1, 128]; length range of the inner list: [0, 128]; character range: [1, 1000000].
  • reference: format: List[str]; list length range: [0, 128]; character range: [1, 1000000].

The list lengths of user_input, response, and reference must be the same as the length of the outer list of retrieved_contexts.

language

String

Optional

Local language. If this parameter is specified, the specified language is used for evaluation.

The default value is None. If this parameter is not set, the default prompt provided by RAGAS is used.

The value can be chinese or english.

prompts_path

String

Optional

Localized prompt. If this parameter is specified, the system searches for the corresponding prompt file in the prompt_dir directory based on the set language. If the prompt file is found, the evaluation process can be accelerated. The size of each file in the directory cannot exceed 4 MB, the level cannot exceed 64, and the total number of files cannot exceed 512.

The default value is None.

The character string length is [1, 255].

show_progress

Bool

Optional

Whether to display the progress bar during evaluation. By default, the progress bar is not displayed.

Return Value

Data Type

Description

Optional[Dict[str, List[float]]]

A dictionary is returned:

  • Keys: metric name (string), for example, answer_correctness and context_precision.
  • Values: Each key corresponds to a floating-point list. Each element in the list indicates the evaluation score of the metric on each dataset sample.
  • None is returned if an exception occurs, for example, an error is reported, during the evaluation.