Skip to content

Bumping tokenizers pin to 1c432479f6b29ce8defc5d2e83375be9238fe1bf#17569

Open
larryliu0820 wants to merge 1 commit intomainfrom
export-D93019471
Open

Bumping tokenizers pin to 1c432479f6b29ce8defc5d2e83375be9238fe1bf#17569
larryliu0820 wants to merge 1 commit intomainfrom
export-D93019471

Conversation

@larryliu0820
Copy link
Contributor

@larryliu0820 larryliu0820 commented Feb 19, 2026

So that we can fix a test failure

Summary:

  • [FEAT] Add prepend normalizer

This commit introduces prepend normalizer, similiar to hugging faces's
rust implementation.

  • [FEAT] Add "skip_special_tokens" parameter to the decode function

Added decode function parameter, to optionaly skip decoding special
tokens. Similiary to the HF Rust implementaiton. This change should be
agnostic unless set to true.

  • [FEAT] Add funciton "piece_to_id"

This commit introduces public member function that converts string to
token id. This function is reverse of already existing 'id_to_piece'

  • [FEAT] Add handling of null pretokenizer and bytefallback json fields

Added:

  • Handling of pretokenizer field explicitly set to null
  • Handling of bytefallback field along with encode logic
  • [FIX] Changed decode API to work on vectors instead of singular tokens

  • [REFACTOR] Changed tests to reflect new decode API

  • [FIX] Change decoders to work on vectors

  • [FEAT] Make postprocessing a separte step

Postprocessing is now separate, configurable step similiar to normalization, pretokenization or decoding.

  • Revert "[FIX] Changed decode API to work on vectors instead of singular tokens"

This reverts commit 08e1b399e4fafcecc78c1941b6331782f7d65469.

  • [REFACTOR] Split loading function of HFTokenizer

  • Revert tests as there's no longer vectorized decode api

  • [FIX] Fix handling of unknown tokens in bpm

  • [FIX] Added FuseDecoder implementation

  • Fix python bindings

  • [FIX] post_processor, remove silent fails

This commit, removes BertProcessor and RobertaProcessor skeleton
classes.

  • chore: Add test cases

This commit adds test cases for:

  • PieceToId logic
  • skip_special_tokens logic
  • PrependNormalizer
  • chore: add python binding for batch decode

  • chore: add post_processor to BUCK file

  • chore: fix formatting in token_decoder.h, remove placeholder code in post_processor.h

  • chore: change copyright handle to SWM

  • feat: add tests requested in review

  • chore: Unify logs in piece_to_id definitions

  • chore: fix tests to ensure parity with rust implementaiton outputs

  • chore: add python test for batch_decode

Differential Revision: D93019471

@pytorch-bot
Copy link

pytorch-bot bot commented Feb 19, 2026

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17569

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures

As of commit c54996a with merge base a398a96 (image):

NEW FAILURES - The following jobs have failed:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 19, 2026
@meta-codesync
Copy link
Contributor

meta-codesync bot commented Feb 19, 2026

@larryliu0820 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D93019471.

@larryliu0820 larryliu0820 added the release notes: llm Changes to llm utilities label Feb 19, 2026
larryliu0820 added a commit that referenced this pull request Feb 19, 2026
Summary:
Pull Request resolved: #17569

* [FEAT] Add prepend normalizer

This commit introduces prepend normalizer, similiar to hugging faces's
rust implementation.

* [FEAT] Add "skip_special_tokens" parameter to the decode function

Added decode function parameter, to optionaly skip decoding special
tokens. Similiary to the HF Rust implementaiton. This change should be
agnostic unless set to true.

* [FEAT] Add funciton "piece_to_id"

This commit introduces public member function that converts string to
token id. This function is reverse of already existing 'id_to_piece'

* [FEAT] Add handling of null pretokenizer and bytefallback json fields

Added:
  - Handling of pretokenizer field explicitly set to null
  - Handling of bytefallback field along with encode logic

* [FIX] Changed decode API to work on vectors instead of singular tokens

* [REFACTOR] Changed tests to reflect new decode API

* [FIX] Change decoders to work on vectors

* [FEAT] Make postprocessing a separte step

Postprocessing is now separate, configurable step similiar to normalization, pretokenization or decoding.

* Revert "[FIX] Changed decode API to work on vectors instead of singular tokens"

This reverts commit 08e1b399e4fafcecc78c1941b6331782f7d65469.

* [REFACTOR] Split loading function of HFTokenizer

* Revert tests as there's no longer vectorized decode api

* [FIX] Fix handling of unknown tokens in bpm

* [FIX] Added FuseDecoder implementation

* Fix python bindings

* [FIX] post_processor, remove silent fails

This commit, removes BertProcessor and RobertaProcessor skeleton
classes.

* chore: Add test cases

This commit adds test cases for:
- PieceToId logic
- skip_special_tokens logic
- PrependNormalizer

* chore: add python binding for batch decode

* chore: add post_processor to BUCK file

* chore: fix formatting in token_decoder.h, remove placeholder code in post_processor.h

* chore: change copyright handle to SWM

* feat: add tests requested in review

* chore: Unify logs in piece_to_id definitions

* chore: fix tests to ensure parity with rust implementaiton outputs

* chore: add python test for batch_decode

Differential Revision: D93019471
@larryliu0820 larryliu0820 changed the title Add missing HF functionalities (#170) Bumping tokenizers pin to 1c432479f6b29ce8defc5d2e83375be9238fe1bf Feb 19, 2026
larryliu0820 added a commit that referenced this pull request Feb 19, 2026
Summary:
Pull Request resolved: #17569

* [FEAT] Add prepend normalizer

This commit introduces prepend normalizer, similiar to hugging faces's
rust implementation.

* [FEAT] Add "skip_special_tokens" parameter to the decode function

Added decode function parameter, to optionaly skip decoding special
tokens. Similiary to the HF Rust implementaiton. This change should be
agnostic unless set to true.

* [FEAT] Add funciton "piece_to_id"

This commit introduces public member function that converts string to
token id. This function is reverse of already existing 'id_to_piece'

* [FEAT] Add handling of null pretokenizer and bytefallback json fields

Added:
  - Handling of pretokenizer field explicitly set to null
  - Handling of bytefallback field along with encode logic

* [FIX] Changed decode API to work on vectors instead of singular tokens

* [REFACTOR] Changed tests to reflect new decode API

* [FIX] Change decoders to work on vectors

* [FEAT] Make postprocessing a separte step

Postprocessing is now separate, configurable step similiar to normalization, pretokenization or decoding.

* Revert "[FIX] Changed decode API to work on vectors instead of singular tokens"

This reverts commit 08e1b399e4fafcecc78c1941b6331782f7d65469.

* [REFACTOR] Split loading function of HFTokenizer

* Revert tests as there's no longer vectorized decode api

* [FIX] Fix handling of unknown tokens in bpm

* [FIX] Added FuseDecoder implementation

* Fix python bindings

* [FIX] post_processor, remove silent fails

This commit, removes BertProcessor and RobertaProcessor skeleton
classes.

* chore: Add test cases

This commit adds test cases for:
- PieceToId logic
- skip_special_tokens logic
- PrependNormalizer

* chore: add python binding for batch decode

* chore: add post_processor to BUCK file

* chore: fix formatting in token_decoder.h, remove placeholder code in post_processor.h

* chore: change copyright handle to SWM

* feat: add tests requested in review

* chore: Unify logs in piece_to_id definitions

* chore: fix tests to ensure parity with rust implementaiton outputs

* chore: add python test for batch_decode

Differential Revision: D93019471
Summary:
Pull Request resolved: #17569

* [FEAT] Add prepend normalizer

This commit introduces prepend normalizer, similiar to hugging faces's
rust implementation.

* [FEAT] Add "skip_special_tokens" parameter to the decode function

Added decode function parameter, to optionaly skip decoding special
tokens. Similiary to the HF Rust implementaiton. This change should be
agnostic unless set to true.

* [FEAT] Add funciton "piece_to_id"

This commit introduces public member function that converts string to
token id. This function is reverse of already existing 'id_to_piece'

* [FEAT] Add handling of null pretokenizer and bytefallback json fields

Added:
  - Handling of pretokenizer field explicitly set to null
  - Handling of bytefallback field along with encode logic

* [FIX] Changed decode API to work on vectors instead of singular tokens

* [REFACTOR] Changed tests to reflect new decode API

* [FIX] Change decoders to work on vectors

* [FEAT] Make postprocessing a separte step

Postprocessing is now separate, configurable step similiar to normalization, pretokenization or decoding.

* Revert "[FIX] Changed decode API to work on vectors instead of singular tokens"

This reverts commit 08e1b399e4fafcecc78c1941b6331782f7d65469.

* [REFACTOR] Split loading function of HFTokenizer

* Revert tests as there's no longer vectorized decode api

* [FIX] Fix handling of unknown tokens in bpm

* [FIX] Added FuseDecoder implementation

* Fix python bindings

* [FIX] post_processor, remove silent fails

This commit, removes BertProcessor and RobertaProcessor skeleton
classes.

* chore: Add test cases

This commit adds test cases for:
- PieceToId logic
- skip_special_tokens logic
- PrependNormalizer

* chore: add python binding for batch decode

* chore: add post_processor to BUCK file

* chore: fix formatting in token_decoder.h, remove placeholder code in post_processor.h

* chore: change copyright handle to SWM

* feat: add tests requested in review

* chore: Unify logs in piece_to_id definitions

* chore: fix tests to ensure parity with rust implementaiton outputs

* chore: add python test for batch_decode

Differential Revision: D93019471
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. fb-exported meta-exported release notes: llm Changes to llm utilities

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments