Overview
Currently, the RAGAS module is widely used in the RAG evaluation field, offering a comprehensive and evolving metric suite. Adapting RAGAS for Ascend significantly increases the module's flexibility and usability, while simultaneously strengthening its performance and compatibility with Chinese language tasks. The figure below illustrates the evaluation process.

Table 1 describes the details of evaluation metrics. For more details, visit the official RAGAS website.
|
Metric |
Description |
Required Parameter |
Evaluation Score Description |
|---|---|---|---|
|
answer_correctness |
Measures the correctness of the model answer against the reference answer by combining the fact accuracy and semantic similarity. |
[user_input, response, reference] |
The value range is [0, 1]. A larger value indicates higher correctness. This metric focuses on two aspects:
An answer is deemed correct only if it is factually accurate, verifiable, and aligns with the provided reference answer. A higher score indicates that the answer output by the model is both accurate and semantically similar. |
|
answer_relevancy |
Measures answer relevance to the given question. Scores will be deducted for answers that are incomplete or contain superfluous and redundant information. |
[user_input, response] |
The score ranges from 0 to 1, with 1 being the best.
|
|
answer_similarity |
Measures the semantic similarity between the generated answer and the reference answer and quantifies the similarity using the cross encoder score. |
[reference, response] |
|
|
context_precision |
Measures the proportion of relevant chunks in the retrieved context by calculating the average value of precision@k of each chunk. precision@k is the ratio of the number of relevant chunks to the total number of retrieved chunks among the first k retrieved results. |
[user_input, reference, retrieved_contexts] |
The value range is [0, 1]. A larger value indicates higher relevance. |
|
context_recall |
Measures the number of relevant documents (or information fragments) that the system successfully retrieves, with a focus on retaining important content. A higher recall rate indicates that fewer relevant documents are missing. |
[user_input, reference, retrieved_contexts] |
The value range is [0, 1]. A larger value indicates higher relevance. |
|
context_entity_recall |
Measures the recall rate based on the number of entities in the reference answer and retrieved context. It specifically calculates the proportion of entities in the reference answer that is also present in the retrieved context. |
[reference, retrieved_contexts] |
The value range is [0, 1]. A larger value indicates higher relevance. |
|
context_utilization |
Measures the relevance between answer and contexts. |
[user_input, response, retrieved_contexts] |
The value range is [0, 1]. A larger value indicates higher relevance. |
|
faithfulness |
Measures the fact consistency between the system response and the retrieved context. |
[user_input, response, retrieved_contexts] |
The score range is [0, 1]. A higher score indicates better consistency. |
|
noise_sensitivity |
Measures the system's failure rate caused by incorrect answers during relevant or irrelevant document retrieval. |
[user_input, reference, response, retrieved_contexts] |
The score range is [0, 1]. A lower score indicates better system performance. |
|
nv_accuracy |
Measures the consistency between the model's answer and the reference answer. Specifically, each of two independently "LLM judges" assigns scores of 0, 2, or 4 to the model's answer. These raw scores are then normalized to a [0, 1] interval and averaged to produce the final score. |
[user_input, response, reference] |
The score range is [0, 1]. A higher score indicates that the model's answer is closer to the reference answer. |
|
nv_context_relevance |
Evaluates whether the retrieved context (snippet or paragraph) is relevant to the user input (question). Specifically, each of two independent "LLM judges" assigns scores of 0, 1, or 2 to the relevance. These raw scores are then normalized to a [0, 1] interval and averaged to produce the final score. |
[user_input, retrieved_contexts] |
The score range is [0, 1]. A higher score indicates a stronger relevance between the retrieved context and the user question. |
|
nv_response_groundedness |
Measures the extent to which the model's answer is grounded in the retrieved context. It evaluates whether each assertion or information within the answer is fully or partially supported by the retrieved context. |
[response, retrieved_contexts] |
The score range is [0, 1]. A higher score indicates that the answer is more robustly supported by the retrieved context. |