adds mmlu-pro #1031

NathanHB · 2025-10-31T13:13:53Z

to run:

lighteval endpoint inference-providers "model_name=openai/gpt-oss-20b,provider=hyperbolic,generation_parameters={max_new_tokens:8192}" "lighteval|mmlu_pro|0" --save-details

Copilot

Pull Request Overview

This PR adds support for the MMLU Pro benchmark, a multiple-choice question answering task from the TIGER-Lab/MMLU-Pro dataset.

Introduces a new MMLU Pro task configuration
Implements a custom prompt function for MMLU Pro questions
Configures evaluation on the test split with validation for few-shots

Comments suppressed due to low confidence (8)

src/lighteval/tasks/tasks/mmlu_pro.py:74

The task configuration is missing the generation_size parameter, which is required for generative metrics like gpqa_instruct_metric. Based on similar tasks using this metric (e.g., gpqa.py lines 57, 73, 89), a value like generation_size=30 or generation_size=32768 should be specified depending on whether reasoning traces are expected.
src/lighteval/tasks/tasks/mmlu_pro.py:74
The task configuration is missing the stop_sequence parameter. Based on the generative nature of the task and similar configurations (e.g., gpqa.py lines 59, 75, 91), stop_sequence=[] should be explicitly set to use the EOS token.
src/lighteval/tasks/tasks/mmlu_pro.py:23
Import of 'LogLikelihoodAccMetric' is not used.

https://arxiv.org/abs/2406.01574
"""
from string import ascii_uppercase

src/lighteval/tasks/tasks/mmlu_pro.py:25

Import of 'LogProbCharNorm' is not used.
Import of 'LogProbPMINorm' is not used.
Import of 'LogProbTokenNorm' is not used.

from lighteval.metrics.metrics import Metrics

src/lighteval/tasks/tasks/mmlu_pro.py:27

Import of 'get_metrics_for_formulation' is not used.

from lighteval.tasks.requests import Doc

src/lighteval/tasks/tasks/mmlu_pro.py:29

Import of 'get_mcq_prompt_function' is not used.
src/lighteval/tasks/tasks/mmlu_pro.py:34
Import of 'CFFormulation' is not used.
Import of 'HybridFormulation' is not used.
Import of 'MCFFormulation' is not used.

TEMPLATE = """
Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.

{question}

src/lighteval/tasks/tasks/mmlu_pro.py:35

Import of 'Language' is not used.

{choices}

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-10-31T13:16:42Z

src/lighteval/tasks/tasks/mmlu_pro.py

+
+mmlu_pro = LightevalTaskConfig(
+        name="mmlu_pro",
+        prompt_function=mmlu_pro_prompt_function,


The variable choices on line 59 is a string (created by str.join() on line 49), so len(choices) returns the string length rather than the number of options. This should be len(line["options"]) to correctly slice the uppercase letters corresponding to the actual number of answer choices.

Copilot · 2025-10-31T13:16:43Z

src/lighteval/tasks/tasks/mmlu_pro.py

+
+Answer:""".strip()
+
+


The template hardcodes 'ABCD' as the possible letters, but MMLU Pro typically has 10 answer choices (A-J). The instruction should be updated to reflect the actual range of possible letters, such as 'where LETTER is one of A through J' or made dynamic based on the number of options.

Copilot · 2025-10-31T13:16:43Z

src/lighteval/tasks/tasks/mmlu_pro.py

+https://arxiv.org/abs/2406.01574
+"""
+from string import ascii_uppercase
+
+from lighteval.metrics.metrics import Metrics
+from lighteval.tasks.lighteval_task import LightevalTaskConfig
+from lighteval.tasks.requests import Doc
+
+
+TEMPLATE = """
+Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.
+
+{question}
+
+{choices}


Multiple imports are unused: LogLikelihoodAccMetric, LogProbCharNorm, LogProbPMINorm, LogProbTokenNorm, get_metrics_for_formulation, get_mcq_prompt_function, CFFormulation, HybridFormulation, MCFFormulation, and Language. These should be removed to keep the imports clean and maintainable.

HuggingFaceDocBuilderDev · 2025-10-31T13:18:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

adds mmlu-pro

d8fe0d5

NathanHB requested a review from Copilot October 31, 2025 13:13

adds mmlu-pro

9c611aa

Copilot AI reviewed Oct 31, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

adds mmlu-pro #1031

adds mmlu-pro #1031

Uh oh!

NathanHB commented Oct 31, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

Copilot AI Oct 31, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants


		Answer:""".strip()

adds mmlu-pro #1031

Are you sure you want to change the base?

adds mmlu-pro #1031

Uh oh!

Conversation

NathanHB commented Oct 31, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 31, 2025

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

NathanHB commented Oct 31, 2025 •

edited

Loading