-
Notifications
You must be signed in to change notification settings - Fork 30.3k
feat(tokenization): add encode_message to tokenize messages one by one #39507
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(tokenization): add encode_message to tokenize messages one by one #39507
Conversation
I like this! cc @Rocketknight1 if you can have a look! |
Hi @pco111, this is a cool idea, but I'm not sure about some of the details! In particular, the interaction with
But in this case, |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some comments! Also, check the CI on Github - you may need to run make fixup
to get the style tests to pass.
if conversation_history is None: | ||
conversation_history = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case where conversation_history is None
, presumably you just want to return the output of apply_chat_template()
without changes?
@@ -1695,6 +1695,89 @@ def apply_chat_template( | |||
else: | |||
return rendered_chat | |||
|
|||
def _encode_message( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need a separate helper function! This can be folded into the main function to keep things simpler.
@@ -3253,7 +3336,7 @@ def pad( | |||
pad_to_multiple_of (`int`, *optional*): | |||
If set will pad the sequence to a multiple of the provided value. | |||
|
|||
This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability | |||
This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Tensor Cores" is correct, so we don't want this change.
@@ -375,3 +376,34 @@ def test_training_new_tokenizer_edge_cases(self): | |||
tokenizer = PreTrainedTokenizerFast(tokenizer_object=_tokenizer) | |||
toy_text_iterator = ("a" for _ in range(1000)) | |||
tokenizer.train_new_from_iterator(text_iterator=toy_text_iterator, length=1000, vocab_size=50) | |||
|
|||
|
|||
class ChatTemplateTest(unittest.TestCase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are some other chat template tests in existing test classes already, so this should probably go in one of those rather than making a new class!
|
Hi @pco111, I think Copilot (or whatever code agent you're using) is making a lot of unrelated changes to the docstring in |
|
Hi @Rocketknight1 , sorry asking again, but just a gentle reminder 😊 |
…arameter and add the corresponding error handling. Update the document to reflect this change and verify the error handling in the test.
… the empty dialogue history, and ensure that the chat template can be applied correctly when the dialogue history is empty. Update the document to reflect these changes.
…simplified, and the functional integrity of the `encode_message` method is ensured. Update the document to reflect these changes.
3edcc89
to
f14a3ee
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM now! cc @ArthurZucker if you're still happy with it - it's adding a new method to all tokenizers so it probably needs a core maintainer review.
I might also look at refactoring/changing this after chat schemas are added, since we might be able to use those to isolate tokens from the final message too.
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fine by me!
Co-authored-by: Arthur <[email protected]>
…message_with_chat_template` to support the chat template, and adjust the relevant test cases to reflect this change.
…ge multi-line calls into a single line to improve code readability.
Hi @Rocketknight1 and @ArthurZucker , thank you for your timely reply. I have changed the function name as suggested by @ArthurZucker . Now there are 2 workflows that need to be approved. Please have a look. Thank you ! |
Hi @ArthurZucker and @Rocketknight1 , thank you for approving the changes. I'm running into an issue where the main branch is updated frequently. By the time the required checks (which need approval) are complete, my branch is already out of date again. In this way, it seems to fall into a dead cycle where the merge can never happen.
|
Hey! Don't worry you are fine! We can merge now! 🤗 merging |
What does this PR do?
This PR introduces a new method, tokenizer.encode_message, to the base tokenizer class. This method allows for tokenizing a single chat message at a time while correctly handling the conversational context provided by conversation_history. This is particularly useful for token-by-token streaming applications where re-tokenizing the entire conversation history for each new token is inefficient.
The new method works by applying the chat template to the full conversation (history + new message) and then programmatically isolating the tokens that correspond to the new message. This ensures that all special tokens, roles, and formatting are applied correctly according to the model's chat template, maintaining consistency with apply_chat_template.
Fixes #39417
Before submitting
[x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline,
Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[x] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[x] Did you write any new necessary tests?
Who can review?
@ArthurZucker @Rocketknight1