Add Tokenizer Comparison Script #150

pandora-s-git · 2025-11-10T15:10:08Z

This script compares the basic .encode tokenization between Hugging Face and Mistral Common tokenizers across multiple datasets.

This script compares the basic `.encode` tokenization between Hugging Face and Mistral Common tokenizers across multiple datasets.

juliendenize

Thanks a bunch for these scripts, left few comments. For general feedback and discussions:

Can you run pre-commit to silence the tests and get rid of linter/formatter issues ?
As we plan to also host chat templates, wdyt about creating a folder to host scripts, files dedicated to integrations in libraries ?

scripts/compare_tokenizer.py

pandora-s-git · 2025-11-12T15:39:02Z

As we plan to also host chat templates, wdyt about creating a folder to host scripts, files dedicated to integrations in libraries ?

ive currently added a scripts folder, would you like to have subfolders for each integration?

also, should I include in this script more test cases like basic instruct datasets, function calling, etc?

juliendenize · 2025-11-13T13:23:03Z

ive currently added a scripts folder, would you like to have subfolders for each integration?

Yeah actually I was thinking the other way around having a parent folder named integration or external ? (if you come up with better naming don't hesitate) that would contains scripts as a subfolder with chat_templates.

also, should I include in this script more test cases like basic instruct datasets, function calling, etc?

Yes it would be nice to have multimodal (audio, image), function calling, instruct and reasoning. If you need help lmk :)

add tokenizer comparison script

68f9bcc

This script compares the basic `.encode` tokenization between Hugging Face and Mistral Common tokenizers across multiple datasets.

juliendenize reviewed Nov 12, 2025

View reviewed changes

scripts/compare_tokenizer.py Outdated Show resolved Hide resolved

scripts/compare_tokenizer.py Outdated Show resolved Hide resolved

scripts/compare_tokenizer.py Outdated Show resolved Hide resolved

max_chars arg and nits

5d0c69e

formatting

aedfb3d

pandora-s-git added 3 commits November 14, 2025 16:26

minor bos fix

60e6e75

restructure and new full_compare script

83854cb

cleaning and add random option

cec8f37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Tokenizer Comparison Script #150

Add Tokenizer Comparison Script #150

Uh oh!

pandora-s-git commented Nov 10, 2025

Uh oh!

juliendenize left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pandora-s-git commented Nov 12, 2025 •

edited

Loading

Uh oh!

juliendenize commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add Tokenizer Comparison Script #150

Are you sure you want to change the base?

Add Tokenizer Comparison Script #150

Uh oh!

Conversation

pandora-s-git commented Nov 10, 2025

Uh oh!

juliendenize left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

pandora-s-git commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juliendenize commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

pandora-s-git commented Nov 12, 2025 •

edited

Loading