- 
                Notifications
    
You must be signed in to change notification settings  - Fork 370
 
adds mmlu-pro #1031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
adds mmlu-pro #1031
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for the MMLU Pro benchmark, a multiple-choice question answering task from the TIGER-Lab/MMLU-Pro dataset.
- Introduces a new MMLU Pro task configuration
 - Implements a custom prompt function for MMLU Pro questions
 - Configures evaluation on the test split with validation for few-shots
 
Comments suppressed due to low confidence (8)
src/lighteval/tasks/tasks/mmlu_pro.py:74
- The task configuration is missing the 
generation_sizeparameter, which is required for generative metrics likegpqa_instruct_metric. Based on similar tasks using this metric (e.g., gpqa.py lines 57, 73, 89), a value likegeneration_size=30orgeneration_size=32768should be specified depending on whether reasoning traces are expected.
src/lighteval/tasks/tasks/mmlu_pro.py:74 - The task configuration is missing the 
stop_sequenceparameter. Based on the generative nature of the task and similar configurations (e.g., gpqa.py lines 59, 75, 91),stop_sequence=[]should be explicitly set to use the EOS token.
src/lighteval/tasks/tasks/mmlu_pro.py:23 - Import of 'LogLikelihoodAccMetric' is not used.
 
https://arxiv.org/abs/2406.01574
"""
from string import ascii_uppercase
src/lighteval/tasks/tasks/mmlu_pro.py:25
- Import of 'LogProbCharNorm' is not used.
Import of 'LogProbPMINorm' is not used.
Import of 'LogProbTokenNorm' is not used. 
from lighteval.metrics.metrics import Metrics
src/lighteval/tasks/tasks/mmlu_pro.py:27
- Import of 'get_metrics_for_formulation' is not used.
 
from lighteval.tasks.requests import Doc
src/lighteval/tasks/tasks/mmlu_pro.py:29
- Import of 'get_mcq_prompt_function' is not used.
src/lighteval/tasks/tasks/mmlu_pro.py:34 - Import of 'CFFormulation' is not used.
Import of 'HybridFormulation' is not used.
Import of 'MCFFormulation' is not used. 
TEMPLATE = """
Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering.
{question}
src/lighteval/tasks/tasks/mmlu_pro.py:35
- Import of 'Language' is not used.
 
{choices}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 
               | 
          ||
| mmlu_pro = LightevalTaskConfig( | ||
| name="mmlu_pro", | ||
| prompt_function=mmlu_pro_prompt_function, | 
    
      
    
      Copilot
AI
    
    
    
      Oct 31, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The variable choices on line 59 is a string (created by str.join() on line 49), so len(choices) returns the string length rather than the number of options. This should be len(line["options"]) to correctly slice the uppercase letters corresponding to the actual number of answer choices.
| 
               | 
          ||
| Answer:""".strip() | ||
| 
               | 
          ||
| 
               | 
          
    
      
    
      Copilot
AI
    
    
    
      Oct 31, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The template hardcodes 'ABCD' as the possible letters, but MMLU Pro typically has 10 answer choices (A-J). The instruction should be updated to reflect the actual range of possible letters, such as 'where LETTER is one of A through J' or made dynamic based on the number of options.
| https://arxiv.org/abs/2406.01574 | ||
| """ | ||
| from string import ascii_uppercase | ||
| 
               | 
          ||
| from lighteval.metrics.metrics import Metrics | ||
| from lighteval.tasks.lighteval_task import LightevalTaskConfig | ||
| from lighteval.tasks.requests import Doc | ||
| 
               | 
          ||
| 
               | 
          ||
| TEMPLATE = """ | ||
| Answer the following multiple choice question. The last line of your response should be of the following format: 'Answer: $LETTER' (without quotes) where LETTER is one of ABCD. Think step by step before answering. | ||
| 
               | 
          ||
| {question} | ||
| 
               | 
          ||
| {choices} | 
    
      
    
      Copilot
AI
    
    
    
      Oct 31, 2025 
    
  
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Multiple imports are unused: LogLikelihoodAccMetric, LogProbCharNorm, LogProbPMINorm, LogProbTokenNorm, get_metrics_for_formulation, get_mcq_prompt_function, CFFormulation, HybridFormulation, MCFFormulation, and Language. These should be removed to keep the imports clean and maintainable.
| 
           The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.  | 
    
to run: