feat(tokenization): add encode_message to tokenize messages one by one #39507

pco111 · 2025-07-18T16:41:21Z

What does this PR do?
This PR introduces a new method, tokenizer.encode_message, to the base tokenizer class. This method allows for tokenizing a single chat message at a time while correctly handling the conversational context provided by conversation_history. This is particularly useful for token-by-token streaming applications where re-tokenizing the entire conversation history for each new token is inefficient.
The new method works by applying the chat template to the full conversation (history + new message) and then programmatically isolating the tokens that correspond to the new message. This ensures that all special tokens, roles, and formatting are applied correctly according to the model's chat template, maintaining consistency with apply_chat_template.

Fixes #39417
Before submitting
[x] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[x] Did you read the contributor guideline,
Pull Request section?
[x] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[x] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[x] Did you write any new necessary tests?

Who can review?
@ArthurZucker @Rocketknight1

ArthurZucker · 2025-07-21T10:05:05Z

I like this! cc @Rocketknight1 if you can have a look!

Rocketknight1 · 2025-07-21T12:58:09Z

Hi @pco111, this is a cool idea, but I'm not sure about some of the details! In particular, the interaction with add_generation_prompt is awkward. If we set that to True, then a common scenario is that the conversation_history will be tokenized like this, where the "generation prompt" is the last line:

<im_start>user
message<im_end>
<im_start>assistant

But in this case, encode_message() will treat <im_start>assistant as part of the history, and remove it from the encoded message, and then the encoded message will be incomplete. I'm not sure what the best solution is - maybe always set add_generation_prompt to False?

pco111 · 2025-07-21T16:39:56Z

Hi @pco111, this is a cool idea, but I'm not sure about some of the details! In particular, the interaction with add_generation_prompt is awkward. If we set that to True, then a common scenario is that the conversation_history will be tokenized like this, where the "generation prompt" is the last line:
<im_start>user
message<im_end>
<im_start>assistant
But in this case, encode_message() will treat <im_start>assistant as part of the history, and remove it from the encoded message, and then the encoded message will be incomplete. I'm not sure what the best solution is - maybe always set add_generation_prompt to False?

Hi @Rocketknight1 ,

Thank you so much for your insightful feedback! You've pointed out a very important edge case with add_generation_prompt that I had overlooked.

Following your thoughts, I've opted for a clearer and more robust approach:

Explicitly Disallowed add_generation_prompt: The encode_message method now raises a ValueError if add_generation_prompt is passed. This prevents any ambiguity.

Updated Documentation: The docstring for encode_message now clearly states that it does not handle the generation prompt and advises users on how to add it separately if needed.

Updated Tests: The tests have been updated to reflect this new design. There is now a test to ensure that the ValueError is raised correctly.

Thank you again for guiding me toward a better solution! I've pushed the new changes for your review.

Rocketknight1

Made some comments! Also, check the CI on Github - you may need to run make fixup to get the style tests to pass.

Rocketknight1 · 2025-07-22T12:52:49Z

src/transformers/tokenization_utils_base.py

+        if conversation_history is None:
+            conversation_history = []


In the case where conversation_history is None, presumably you just want to return the output of apply_chat_template() without changes?

Rocketknight1 · 2025-07-22T12:53:33Z

src/transformers/tokenization_utils_base.py

@@ -1695,6 +1695,89 @@ def apply_chat_template(
        else:
            return rendered_chat

+    def _encode_message(


I'm not sure we need a separate helper function! This can be folded into the main function to keep things simpler.

Rocketknight1 · 2025-07-22T12:53:59Z

src/transformers/tokenization_utils_base.py

@@ -3253,7 +3336,7 @@ def pad(
            pad_to_multiple_of (`int`, *optional*):
                If set will pad the sequence to a multiple of the provided value.

-                This is especially useful to enable the use of Tensor Cores on NVIDIA hardware with compute capability
+                This is especially useful to enable the use of Tensor Core on NVIDIA hardware with compute capability


"Tensor Cores" is correct, so we don't want this change.

Rocketknight1 · 2025-07-22T12:56:05Z

tests/tokenization/test_tokenization_utils.py

@@ -375,3 +376,34 @@ def test_training_new_tokenizer_edge_cases(self):
        tokenizer = PreTrainedTokenizerFast(tokenizer_object=_tokenizer)
        toy_text_iterator = ("a" for _ in range(1000))
        tokenizer.train_new_from_iterator(text_iterator=toy_text_iterator, length=1000, vocab_size=50)
+
+
+class ChatTemplateTest(unittest.TestCase):


There are some other chat template tests in existing test classes already, so this should probably go in one of those rather than making a new class!

pco111 · 2025-07-22T18:57:57Z

Made some comments! Also, check the CI on Github - you may need to run make fixup to get the style tests to pass.

Hi @Rocketknight1,

Thank you for the detailed and helpful feedback! I've updated the PR according to all your suggestions:

The _encode_message helper has been folded into the main encode_message function.

An optimization has been added to handle empty conversation history directly.

The "Tensor Cores" typo has been corrected.

The new tests have been moved into the existing TokenizerUtilsTest class.

All local checks (make fixup and pytest) are passing. The code should be in much better shape now. Thanks again for your guidance!

Rocketknight1 · 2025-07-23T12:46:41Z

Hi @pco111, I think Copilot (or whatever code agent you're using) is making a lot of unrelated changes to the docstring in get_chat_template. Can you clean that up and then I'll do a final review?

pco111 · 2025-07-23T16:28:29Z

Hi @pco111, I think Copilot (or whatever code agent you're using) is making a lot of unrelated changes to the docstring in get_chat_template. Can you clean that up and then I'll do a final review?

Hi @Rocketknight1,

My apologies for the unintended changes in the previous commit. It seems my formatting agent was a bit too aggressive.

I've now cleaned up the docstrings and reverted all unrelated changes.

Thank you for your patience. Look forward to your final review!

pco111 · 2025-07-27T15:43:21Z

Hi @pco111, I think Copilot (or whatever code agent you're using) is making a lot of unrelated changes to the docstring in get_chat_template. Can you clean that up and then I'll do a final review?

Hi @Rocketknight1,
My apologies for the unintended changes in the previous commit. It seems my formatting agent was a bit too aggressive.
I've now cleaned up the docstrings and reverted all unrelated changes.
Thank you for your patience. Look forward to your final review!

Hi @Rocketknight1 , sorry asking again, but just a gentle reminder 😊

…arameter and add the corresponding error handling. Update the document to reflect this change and verify the error handling in the test.

… the empty dialogue history, and ensure that the chat template can be applied correctly when the dialogue history is empty. Update the document to reflect these changes.

…simplified, and the functional integrity of the `encode_message` method is ensured. Update the document to reflect these changes.

Rocketknight1

LGTM now! cc @ArthurZucker if you're still happy with it - it's adding a new method to all tokenizers so it probably needs a core maintainer review.

I might also look at refactoring/changing this after chat schemas are added, since we might be able to use those to isolate tokens from the final message too.

HuggingFaceDocBuilderDev · 2025-07-28T14:07:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

Fine by me!

src/transformers/tokenization_utils_base.py

Co-authored-by: Arthur <[email protected]>

…message_with_chat_template` to support the chat template, and adjust the relevant test cases to reflect this change.

…ge multi-line calls into a single line to improve code readability.

pco111 · 2025-07-29T19:52:57Z

Hi @Rocketknight1 and @ArthurZucker , thank you for your timely reply. I have changed the function name as suggested by @ArthurZucker . Now there are 2 workflows that need to be approved. Please have a look. Thank you !

pco111 · 2025-07-30T17:04:32Z

Hi @ArthurZucker and @Rocketknight1 , thank you for approving the changes. I'm running into an issue where the main branch is updated frequently. By the time the required checks (which need approval) are complete, my branch is already out of date again. In this way, it seems to fall into a dead cycle where the merge can never happen.

Could you approve workflow & merge for me, in 1 shot ?
Otherwise, could we possibly coordinate a time to merge? I can update the branch and you could approve the workflows right after, so we can get it merged quickly once the checks pass. Please let me know what time works best for you.

ArthurZucker · 2025-07-31T08:55:38Z

Hey! Don't worry you are fine! We can merge now! 🤗 merging

ArthurZucker added the Chat Template label Jul 21, 2025

Rocketknight1 reviewed Jul 22, 2025

View reviewed changes

pco111 added 7 commits July 28, 2025 14:54

feat(tokenization): add encode_message to tokenize messages one by one

7166f1b

Fix the encode_message method, remove the add_generation_prompt p…

c0c65ba

…arameter and add the corresponding error handling. Update the document to reflect this change and verify the error handling in the test.

Optimize the encode_message method, improve the processing logic of…

423287b

… the empty dialogue history, and ensure that the chat template can be applied correctly when the dialogue history is empty. Update the document to reflect these changes.

The _encode_message method is deleted, the message coding logic is …

7ab0803

…simplified, and the functional integrity of the `encode_message` method is ensured. Update the document to reflect these changes.

Docs fix

b7e57b9

Revert changes in docstring of pad()

f39d67a

Revert changes in docstring

f14a3ee

Rocketknight1 force-pushed the fix/issue-39417-incremental-tokenization branch from 3edcc89 to f14a3ee Compare July 28, 2025 13:54

Rocketknight1 approved these changes Jul 28, 2025

View reviewed changes

ArthurZucker approved these changes Jul 29, 2025

View reviewed changes

src/transformers/tokenization_utils_base.py Outdated Show resolved Hide resolved

pco111 and others added 4 commits July 29, 2025 13:33

Update src/transformers/tokenization_utils_base.py

4d59bc5

Co-authored-by: Arthur <[email protected]>

Repair the call of the encode_message method, update it to `encode_…

d70fbec

…message_with_chat_template` to support the chat template, and adjust the relevant test cases to reflect this change.

Optimize the call format of the apply_chat_template method, and mer…

e859b52

…ge multi-line calls into a single line to improve code readability.

Merge branch 'main' into fix/issue-39417-incremental-tokenization

89dbbaf

Merge branch 'main' into fix/issue-39417-incremental-tokenization

ec454c8

ArthurZucker merged commit cb289ad into huggingface:main Jul 31, 2025
23 checks passed

feat(tokenization): add encode_message to tokenize messages one by one #39507

feat(tokenization): add encode_message to tokenize messages one by one #39507

Uh oh!

Conversation

pco111 commented Jul 18, 2025

Uh oh!

ArthurZucker commented Jul 21, 2025

Uh oh!

Rocketknight1 commented Jul 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pco111 commented Jul 21, 2025

Uh oh!

Rocketknight1 left a comment

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

Rocketknight1 Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

pco111 commented Jul 22, 2025

Uh oh!

Rocketknight1 commented Jul 23, 2025

Uh oh!

pco111 commented Jul 23, 2025

Uh oh!

pco111 commented Jul 27, 2025

Uh oh!

Rocketknight1 left a comment

Choose a reason for hiding this comment

Uh oh!

HuggingFaceDocBuilderDev commented Jul 28, 2025

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

pco111 commented Jul 29, 2025

Uh oh!

pco111 commented Jul 30, 2025

Uh oh!

ArthurZucker commented Jul 31, 2025

Uh oh!

Uh oh!

Uh oh!

Rocketknight1 commented Jul 21, 2025 •

edited

Loading