Overview

Currently, the RAGAS module is widely used in the RAG evaluation field, offering a comprehensive and evolving metric suite. Adapting RAGAS for Ascend significantly increases the module's flexibility and usability, while simultaneously strengthening its performance and compatibility with Chinese language tasks. The figure below illustrates the evaluation process.

Table 1 describes the details of evaluation metrics. For more details, visit the official RAGAS website.

Table 1 Evaluation metrics

Metric

Description

Required Parameter

Evaluation Score Description

answer_correctness

Measures the correctness of the model answer against the reference answer by combining the fact accuracy and semantic similarity.

[user_input, response, reference]

The value range is [0, 1]. A larger value indicates higher correctness.

This metric focuses on two aspects:

  • Fact accuracy: whether the answer is consistent with facts
  • Semantic similarity: whether the answer is semantically similar to the reference answer

An answer is deemed correct only if it is factually accurate, verifiable, and aligns with the provided reference answer.

A higher score indicates that the answer output by the model is both accurate and semantically similar.

answer_relevancy

Measures answer relevance to the given question. Scores will be deducted for answers that are incomplete or contain superfluous and redundant information.

[user_input, response]

The score ranges from 0 to 1, with 1 being the best.

  • This metric focuses on whether the answer is closely related to the question and contains irrelevant, redundant, or missing information.
  • Higher scores signify a concise and comprehensive answer.
  • Lower scores reflect tangential, repetitive, or off-topic content.

answer_similarity

Measures the semantic similarity between the generated answer and the reference answer and quantifies the similarity using the cross encoder score.

[reference, response]

  • This metric evaluates the semantic similarity between the reference answer and generated answer, without requiring exact word-for-word consistency.
  • The cross encoder is a deep learning model that evaluates the semantic similarity between two pieces of text. A higher score indicates higher similarity.
  • The score ranges from 0 to 1, where a higher value signifies greater semantic similarity.

context_precision

Measures the proportion of relevant chunks in the retrieved context by calculating the average value of precision@k of each chunk. precision@k is the ratio of the number of relevant chunks to the total number of retrieved chunks among the first k retrieved results.

[user_input, reference, retrieved_contexts]

The value range is [0, 1]. A larger value indicates higher relevance.

context_recall

Measures the number of relevant documents (or information fragments) that the system successfully retrieves, with a focus on retaining important content. A higher recall rate indicates that fewer relevant documents are missing.

[user_input, reference, retrieved_contexts]

The value range is [0, 1]. A larger value indicates higher relevance.

context_entity_recall

Measures the recall rate based on the number of entities in the reference answer and retrieved context. It specifically calculates the proportion of entities in the reference answer that is also present in the retrieved context.

[reference, retrieved_contexts]

The value range is [0, 1]. A larger value indicates higher relevance.

context_utilization

Measures the relevance between answer and contexts.

[user_input, response, retrieved_contexts]

The value range is [0, 1]. A larger value indicates higher relevance.

faithfulness

Measures the fact consistency between the system response and the retrieved context.

[user_input, response, retrieved_contexts]

The score range is [0, 1]. A higher score indicates better consistency.

noise_sensitivity

Measures the system's failure rate caused by incorrect answers during relevant or irrelevant document retrieval.

[user_input, reference, response, retrieved_contexts]

The score range is [0, 1]. A lower score indicates better system performance.

nv_accuracy

Measures the consistency between the model's answer and the reference answer. Specifically, each of two independently "LLM judges" assigns scores of 0, 2, or 4 to the model's answer. These raw scores are then normalized to a [0, 1] interval and averaged to produce the final score.

[user_input, response, reference]

The score range is [0, 1]. A higher score indicates that the model's answer is closer to the reference answer.

nv_context_relevance

Evaluates whether the retrieved context (snippet or paragraph) is relevant to the user input (question). Specifically, each of two independent "LLM judges" assigns scores of 0, 1, or 2 to the relevance. These raw scores are then normalized to a [0, 1] interval and averaged to produce the final score.

[user_input, retrieved_contexts]

The score range is [0, 1]. A higher score indicates a stronger relevance between the retrieved context and the user question.

nv_response_groundedness

Measures the extent to which the model's answer is grounded in the retrieved context. It evaluates whether each assertion or information within the answer is fully or partially supported by the retrieved context.

[response, retrieved_contexts]

The score range is [0, 1]. A higher score indicates that the answer is more robustly supported by the retrieved context.