Description
Right now, if a task run has an error, it just causes an error with the whole dataset.evaluate run. We should gracefully handle both task errors and evaluator errors, and should have built-in support for retrying the tasks and evaluators (e.g. for things like LLMJudge where you might get an intermittent failure).
This will be closed by #2295 but that PR may introduce breaking changes, so I want to get it merged before V1.
References
No response