Bumping tokenizers pin to 1c432479f6b29ce8defc5d2e83375be9238fe1bf#17569
Open
larryliu0820 wants to merge 1 commit intomainfrom
Open
Bumping tokenizers pin to 1c432479f6b29ce8defc5d2e83375be9238fe1bf#17569larryliu0820 wants to merge 1 commit intomainfrom
larryliu0820 wants to merge 1 commit intomainfrom
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17569
Note: Links to docs will display an error until the docs builds have been completed. ❌ 5 New FailuresAs of commit c54996a with merge base a398a96 ( NEW FAILURES - The following jobs have failed:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Contributor
|
@larryliu0820 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D93019471. |
larryliu0820
added a commit
that referenced
this pull request
Feb 19, 2026
Summary: Pull Request resolved: #17569 * [FEAT] Add prepend normalizer This commit introduces prepend normalizer, similiar to hugging faces's rust implementation. * [FEAT] Add "skip_special_tokens" parameter to the decode function Added decode function parameter, to optionaly skip decoding special tokens. Similiary to the HF Rust implementaiton. This change should be agnostic unless set to true. * [FEAT] Add funciton "piece_to_id" This commit introduces public member function that converts string to token id. This function is reverse of already existing 'id_to_piece' * [FEAT] Add handling of null pretokenizer and bytefallback json fields Added: - Handling of pretokenizer field explicitly set to null - Handling of bytefallback field along with encode logic * [FIX] Changed decode API to work on vectors instead of singular tokens * [REFACTOR] Changed tests to reflect new decode API * [FIX] Change decoders to work on vectors * [FEAT] Make postprocessing a separte step Postprocessing is now separate, configurable step similiar to normalization, pretokenization or decoding. * Revert "[FIX] Changed decode API to work on vectors instead of singular tokens" This reverts commit 08e1b399e4fafcecc78c1941b6331782f7d65469. * [REFACTOR] Split loading function of HFTokenizer * Revert tests as there's no longer vectorized decode api * [FIX] Fix handling of unknown tokens in bpm * [FIX] Added FuseDecoder implementation * Fix python bindings * [FIX] post_processor, remove silent fails This commit, removes BertProcessor and RobertaProcessor skeleton classes. * chore: Add test cases This commit adds test cases for: - PieceToId logic - skip_special_tokens logic - PrependNormalizer * chore: add python binding for batch decode * chore: add post_processor to BUCK file * chore: fix formatting in token_decoder.h, remove placeholder code in post_processor.h * chore: change copyright handle to SWM * feat: add tests requested in review * chore: Unify logs in piece_to_id definitions * chore: fix tests to ensure parity with rust implementaiton outputs * chore: add python test for batch_decode Differential Revision: D93019471
bd2104d to
239d0bf
Compare
mergennachin
approved these changes
Feb 19, 2026
larryliu0820
added a commit
that referenced
this pull request
Feb 19, 2026
Summary: Pull Request resolved: #17569 * [FEAT] Add prepend normalizer This commit introduces prepend normalizer, similiar to hugging faces's rust implementation. * [FEAT] Add "skip_special_tokens" parameter to the decode function Added decode function parameter, to optionaly skip decoding special tokens. Similiary to the HF Rust implementaiton. This change should be agnostic unless set to true. * [FEAT] Add funciton "piece_to_id" This commit introduces public member function that converts string to token id. This function is reverse of already existing 'id_to_piece' * [FEAT] Add handling of null pretokenizer and bytefallback json fields Added: - Handling of pretokenizer field explicitly set to null - Handling of bytefallback field along with encode logic * [FIX] Changed decode API to work on vectors instead of singular tokens * [REFACTOR] Changed tests to reflect new decode API * [FIX] Change decoders to work on vectors * [FEAT] Make postprocessing a separte step Postprocessing is now separate, configurable step similiar to normalization, pretokenization or decoding. * Revert "[FIX] Changed decode API to work on vectors instead of singular tokens" This reverts commit 08e1b399e4fafcecc78c1941b6331782f7d65469. * [REFACTOR] Split loading function of HFTokenizer * Revert tests as there's no longer vectorized decode api * [FIX] Fix handling of unknown tokens in bpm * [FIX] Added FuseDecoder implementation * Fix python bindings * [FIX] post_processor, remove silent fails This commit, removes BertProcessor and RobertaProcessor skeleton classes. * chore: Add test cases This commit adds test cases for: - PieceToId logic - skip_special_tokens logic - PrependNormalizer * chore: add python binding for batch decode * chore: add post_processor to BUCK file * chore: fix formatting in token_decoder.h, remove placeholder code in post_processor.h * chore: change copyright handle to SWM * feat: add tests requested in review * chore: Unify logs in piece_to_id definitions * chore: fix tests to ensure parity with rust implementaiton outputs * chore: add python test for batch_decode Differential Revision: D93019471
239d0bf to
e48284e
Compare
Summary: Pull Request resolved: #17569 * [FEAT] Add prepend normalizer This commit introduces prepend normalizer, similiar to hugging faces's rust implementation. * [FEAT] Add "skip_special_tokens" parameter to the decode function Added decode function parameter, to optionaly skip decoding special tokens. Similiary to the HF Rust implementaiton. This change should be agnostic unless set to true. * [FEAT] Add funciton "piece_to_id" This commit introduces public member function that converts string to token id. This function is reverse of already existing 'id_to_piece' * [FEAT] Add handling of null pretokenizer and bytefallback json fields Added: - Handling of pretokenizer field explicitly set to null - Handling of bytefallback field along with encode logic * [FIX] Changed decode API to work on vectors instead of singular tokens * [REFACTOR] Changed tests to reflect new decode API * [FIX] Change decoders to work on vectors * [FEAT] Make postprocessing a separte step Postprocessing is now separate, configurable step similiar to normalization, pretokenization or decoding. * Revert "[FIX] Changed decode API to work on vectors instead of singular tokens" This reverts commit 08e1b399e4fafcecc78c1941b6331782f7d65469. * [REFACTOR] Split loading function of HFTokenizer * Revert tests as there's no longer vectorized decode api * [FIX] Fix handling of unknown tokens in bpm * [FIX] Added FuseDecoder implementation * Fix python bindings * [FIX] post_processor, remove silent fails This commit, removes BertProcessor and RobertaProcessor skeleton classes. * chore: Add test cases This commit adds test cases for: - PieceToId logic - skip_special_tokens logic - PrependNormalizer * chore: add python binding for batch decode * chore: add post_processor to BUCK file * chore: fix formatting in token_decoder.h, remove placeholder code in post_processor.h * chore: change copyright handle to SWM * feat: add tests requested in review * chore: Unify logs in piece_to_id definitions * chore: fix tests to ensure parity with rust implementaiton outputs * chore: add python test for batch_decode Differential Revision: D93019471
e48284e to
c54996a
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
So that we can fix a test failure
Summary:
This commit introduces prepend normalizer, similiar to hugging faces's
rust implementation.
Added decode function parameter, to optionaly skip decoding special
tokens. Similiary to the HF Rust implementaiton. This change should be
agnostic unless set to true.
This commit introduces public member function that converts string to
token id. This function is reverse of already existing 'id_to_piece'
Added:
[FIX] Changed decode API to work on vectors instead of singular tokens
[REFACTOR] Changed tests to reflect new decode API
[FIX] Change decoders to work on vectors
[FEAT] Make postprocessing a separte step
Postprocessing is now separate, configurable step similiar to normalization, pretokenization or decoding.
This reverts commit 08e1b399e4fafcecc78c1941b6331782f7d65469.
[REFACTOR] Split loading function of HFTokenizer
Revert tests as there's no longer vectorized decode api
[FIX] Fix handling of unknown tokens in bpm
[FIX] Added FuseDecoder implementation
Fix python bindings
[FIX] post_processor, remove silent fails
This commit, removes BertProcessor and RobertaProcessor skeleton
classes.
This commit adds test cases for:
chore: add python binding for batch decode
chore: add post_processor to BUCK file
chore: fix formatting in token_decoder.h, remove placeholder code in post_processor.h
chore: change copyright handle to SWM
feat: add tests requested in review
chore: Unify logs in piece_to_id definitions
chore: fix tests to ensure parity with rust implementaiton outputs
chore: add python test for batch_decode
Differential Revision: D93019471