Skip to content

Conversation

anivar
Copy link
Contributor

@anivar anivar commented Jul 20, 2025

What's the issue?

Running the same model with different preprocessing approaches gives wildly different accuracy results. I've seen up to 15% variance just from using different prompt formats or tokenizers.

What this PR does

Adds minimal preprocessing documentation for:

  • Llama 3.1 8B: Exact prompt template and tokenizer settings
  • DeepSeek-R1: How to handle chain-of-thought outputs and extract final answers

Why it matters

Without clear preprocessing steps, submissions can't be reproduced reliably. This makes it hard to compare results fairly.

Testing

Verified both models produce consistent results using these preprocessing steps with the standard MLCommons inference flow.

Fixes #2245

@anivar anivar requested a review from a team as a code owner July 20, 2025 10:23
Copy link
Contributor

github-actions bot commented Jul 20, 2025

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

hanyunfan
hanyunfan previously approved these changes Jul 21, 2025
Copy link
Contributor

@hanyunfan hanyunfan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, more info added for readme files

@arjunsuresh
Copy link
Contributor

@hanyunfan This is a template not actual information. We should pass this to the respective task forces and get the details.

@mrmhodak
Copy link
Contributor

WG Meeting: Will look at this later.

- Created PREPROCESSING.md template for standardized documentation
- Added comprehensive preprocessing documentation for Llama3.1-8b
- Added comprehensive preprocessing documentation for DeepSeek-r1
- Documented current preprocessing gaps and missing reproducibility steps
- Established standard template for future model documentation
- Based documentation on successful llama2-70b/processorca.py patterns

Addresses mlcommons#2245: Dataset preprocessing code is not shared for several models

This maintenance contribution improves preprocessing transparency by:
1. Documenting existing preprocessing patterns
2. Identifying gaps in current documentation
3. Providing template for consistent future documentation
4. Enabling better adaptation across different tokenizers/models
@anivar anivar force-pushed the fix/preprocessing-documentation branch from 79cc505 to 4e425a0 Compare July 24, 2025 15:48
@anivar
Copy link
Contributor Author

anivar commented Aug 3, 2025

I've simplified this PR based on the successful pattern from #2300. Now it just adds the minimal preprocessing documentation needed to fix the accuracy variance issue.

The changes are:

  • Removed validation scripts and complex code
  • Kept only essential info: tokenizer requirements, prompt templates, and answer extraction
  • Made it easy to copy-paste and use immediately

This should make it much easier to review and merge. Let me know if anything else is needed!

@anivar
Copy link
Contributor Author

anivar commented Aug 17, 2025

Hi @arjunsuresh @mrmhodak,

I see this needs task force input. What's the decision from the WG meeting?

Should I wait for task force details or close this PR?

@arjunsuresh
Copy link
Contributor

@anivar Since this is a template but still under specific benchmark folder I think we need to fill in as much as details as possible to make it useful. If you can join the WG meetings you can get contacts for the Taskforce members who can give you the required information. Inference WG meetings are at 15:30 GMT, every Tuesday.

Update PREPROCESSING.md files with correct information based on actual code.

- DeepSeek-R1: Use apply_chat_template, 32K context
- Llama 3.1-8B: Use instruction template for summarization
- Add general preprocessing guide and examples
@anivar anivar force-pushed the fix/preprocessing-documentation branch from 4e64901 to 75dc325 Compare August 21, 2025 04:34
@anivar
Copy link
Contributor Author

anivar commented Aug 21, 2025

Thanks @arjunsuresh for the feedback. I've now updated the preprocessing documentation with actual implementation details rather than templates.

After reviewing the codebase, I found the existing PREPROCESSING.md files had incorrect information that didn't match the actual code. For example:

  • DeepSeek-R1 was documented with a custom format that doesn't exist - the actual implementation uses apply_chat_template
  • Llama 3.1-8B was shown with chat templates when it actually uses simple instruction format

I've corrected these files based on the actual code in utils/tokenization.py and the preprocessing scripts. The documentation now matches what's actually implemented, so developers can reproduce
the benchmarks correctly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Dataset preprocessing code is not shared for several models
4 participants