-
Notifications
You must be signed in to change notification settings - Fork 112
Add Tokenizer Comparison Script #150
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This script compares the basic `.encode` tokenization between Hugging Face and Mistral Common tokenizers across multiple datasets.
juliendenize
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a bunch for these scripts, left few comments. For general feedback and discussions:
- Can you run pre-commit to silence the tests and get rid of linter/formatter issues ?
- As we plan to also host chat templates, wdyt about creating a folder to host scripts, files dedicated to integrations in libraries ?
ive currently added a scripts folder, would you like to have subfolders for each integration? also, should I include in this script more test cases like basic instruct datasets, function calling, etc? |
Yeah actually I was thinking the other way around having a parent folder named
Yes it would be nice to have multimodal (audio, image), function calling, instruct and reasoning. If you need help lmk :) |
This script compares the basic
.encodetokenization between Hugging Face and Mistral Common tokenizers across multiple datasets.