-
Notifications
You must be signed in to change notification settings - Fork 984
Description
[β ] I checked the documentation and related resources and couldn't find an answer to my question.
Your Question
I'm encountering issues when trying to evaluate my RAG system using Ragas. The answer_correctness
and answer_similarity
metrics calculate correctly, but other metrics like context_recall
, faithfulness
, and context_precision
consistently result in nan
. The evaluation process also shows frequent RagasOutputParserException
and TimeoutError
.
I suspect this might be related to the choice of LLM (likely a local model via Ollama οΌqwen2.5:14bοΌ, as indicated by init_ragas_ollama_components
) or how Ragas handles the output parsing from this specific LLM.
Code Snippet
with open(Path(args.base_dir) / "retrival_results.json",encoding='utf-8') as f:
retirval_results = json.load(f)
embed_wrapper, llm_wrapper = init_ragas_ollama_components(args)
metrics = [answer_correctness, answer_similarity,context_recall,faithfulness, context_precision]
for metric in metrics:
if hasattr(metric, "llm"): metric.llm = llm_wrapper
if hasattr(metric, "embeddings"): metric.embeddings = embed_wrapper
modes = ['naive', 'local', 'global'] if args.mode == 'all' else [args.mode]
all_results = {}
for mode in modes:
samples = retirval_results.get(mode, [])
questions = []
ground_truths = []
predictions = []
contexts = []
for sample in samples:
questions.append(sample.get("question", ""))
ground_truths.append(sample.get("ground_truth", ""))
predictions.append(sample.get("prediction", ""))
contexts.append([sample.get("retrieval_context", "")])
dataset = Dataset.from_dict({
"question": questions,
"ground_truth": ground_truths,
"answer": predictions,
"contexts": contexts
})
results = evaluate(dataset, metrics=metrics)
Error Messages:
Evaluating: 10%|ββββββββββββββ | 5/50 [01:41<20:45, 27.69s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_recall_classification_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[12]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 18%|ββββββββββββββββββββββββββ | 9/50 [02:48<10:41, 15.66s/it]ERROR:ragas.executor:Exception raised in Job[2]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[3]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[4]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[5]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[7]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[8]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[9]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[13]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[14]: TimeoutError()
Evaluating: 38%|βββββββββββββββββββββββββββββββββββββββββββββββββββββ | 19/50 [03:01<01:37, 3.13s/it]ERROR:ragas.executor:Exception raised in Job[17]: TimeoutError()
Evaluating: 42%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 21/50 [03:02<01:14, 2.56s/it]ERROR:ragas.executor:Exception raised in Job[18]: TimeoutError()
Evaluating: 46%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 23/50 [03:03<00:57, 2.12s/it]ERROR:ragas.executor:Exception raised in Job[19]: TimeoutError()
Evaluating: 48%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 24/50 [03:04<00:49, 1.92s/it]ERROR:ragas.executor:Exception raised in Job[20]: TimeoutError()
Evaluating: 50%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 25/50 [04:41<07:17, 17.52s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_recall_classification_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[32]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 54%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 27/50 [05:24<06:47, 17.74s/it]ERROR:ragas.executor:Exception raised in Job[22]: TimeoutError()
Evaluating: 56%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 28/50 [05:33<05:47, 15.77s/it]ERROR:ragas.executor:Exception raised in Job[23]: TimeoutError()
Evaluating: 58%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 29/50 [05:46<05:15, 15.04s/it]ERROR:ragas.executor:Exception raised in Job[24]: TimeoutError()
Evaluating: 60%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 30/50 [05:48<03:52, 11.63s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[34]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 64%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 32/50 [05:55<02:15, 7.53s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[29]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 66%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 33/50 [05:58<01:46, 6.27s/it]ERROR:ragas.executor:Exception raised in Job[25]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[27]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[28]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[30]: TimeoutError()
ERROR:ragas.executor:Exception raised in Job[33]: TimeoutError()
Evaluating: 68%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 34/50 [06:00<01:16, 4.78s/it]ERROR:ragas.executor:Exception raised in Job[35]: TimeoutError()
Evaluating: 78%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 39/50 [06:01<00:19, 1.79s/it]ERROR:ragas.executor:Exception raised in Job[37]: TimeoutError()
Evaluating: 80%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 40/50 [06:02<00:16, 1.60s/it]ERROR:ragas.executor:Exception raised in Job[38]: TimeoutError()
Evaluating: 82%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 41/50 [06:03<00:13, 1.51s/it]ERROR:ragas.executor:Exception raised in Job[39]: TimeoutError()
Evaluating: 86%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 43/50 [06:16<00:27, 3.92s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[49]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 88%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 44/50 [07:52<02:40, 26.72s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt n_l_i_statement_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[43]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 90%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 45/50 [08:17<02:11, 26.36s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_recall_classification_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[47]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 94%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 47/50 [08:24<00:46, 15.39s/it]ERROR:ragas.executor:Exception raised in Job[42]: TimeoutError()
Evaluating: 96%|ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 48/50 [08:24<00:22, 11.13s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt n_l_i_statement_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[48]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 98%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ | 49/50 [08:31<00:09, 9.88s/it]ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt fix_output_format failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.prompt.pydantic_prompt:Prompt context_precision_prompt failed to parse output: The output parser failed to parse the output including retries.
ERROR:ragas.executor:Exception raised in Job[44]: RagasOutputParserException(The output parser failed to parse the output including retries.)
Evaluating: 100%|βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ| 50/50 [08:31<00:00, 10.23s/it]
{'answer_correctness': 0.7917, 'semantic_similarity': 0.7538, 'context_recall': nan, 'faithfulness': nan, 'context_precision': nan}
INFO:main:=== Evaluation finished ===
Additional context:
- I am using a local LLM, likely via Ollama (based on the init_ragas_ollama_components function name).
- The errors suggest the LLM might not be returning output in the expected format for Ragas' parsers, or it might be timing out during the evaluation process for certain metrics.
- answer_correctness and answer_similarity rely more on embeddings and simpler LLM calls, while context_recall, faithfulness, and context_precision typically require the LLM to perform more complex reasoning and structured output generation (like JSON). This difference in required LLM capability might explain why some metrics work and others fail
- Could there be specific prompt or parsing issues with certain local LLMs? Or are there parameters I need to adjust for timeouts or parser robustness?