Skip to content

Conversation

@pandora-s-git
Copy link
Contributor

This script compares the basic .encode tokenization between Hugging Face and Mistral Common tokenizers across multiple datasets.

This script compares the basic `.encode` tokenization between Hugging Face and Mistral Common tokenizers across multiple datasets.
Copy link
Contributor

@juliendenize juliendenize left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a bunch for these scripts, left few comments. For general feedback and discussions:

  • Can you run pre-commit to silence the tests and get rid of linter/formatter issues ?
  • As we plan to also host chat templates, wdyt about creating a folder to host scripts, files dedicated to integrations in libraries ?

@pandora-s-git
Copy link
Contributor Author

pandora-s-git commented Nov 12, 2025

As we plan to also host chat templates, wdyt about creating a folder to host scripts, files dedicated to integrations in libraries ?

ive currently added a scripts folder, would you like to have subfolders for each integration?

also, should I include in this script more test cases like basic instruct datasets, function calling, etc?

@juliendenize
Copy link
Contributor

ive currently added a scripts folder, would you like to have subfolders for each integration?

Yeah actually I was thinking the other way around having a parent folder named integration or external ? (if you come up with better naming don't hesitate) that would contains scripts as a subfolder with chat_templates.

also, should I include in this script more test cases like basic instruct datasets, function calling, etc?

Yes it would be nice to have multimodal (audio, image), function calling, instruct and reasoning. If you need help lmk :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants