Bumping tokenizers pin to 1c432479f6b29ce8defc5d2e83375be9238fe1bf by larryliu0820 · Pull Request #17569 · pytorch/executorch

larryliu0820 · 2026-02-19T20:31:13Z

So that we can fix a test failure

Summary:

[FEAT] Add prepend normalizer

This commit introduces prepend normalizer, similiar to hugging faces's
rust implementation.

[FEAT] Add "skip_special_tokens" parameter to the decode function

Added decode function parameter, to optionaly skip decoding special
tokens. Similiary to the HF Rust implementaiton. This change should be
agnostic unless set to true.

[FEAT] Add funciton "piece_to_id"

This commit introduces public member function that converts string to
token id. This function is reverse of already existing 'id_to_piece'

[FEAT] Add handling of null pretokenizer and bytefallback json fields

Added:

Handling of pretokenizer field explicitly set to null
Handling of bytefallback field along with encode logic

[FIX] Changed decode API to work on vectors instead of singular tokens
[REFACTOR] Changed tests to reflect new decode API
[FIX] Change decoders to work on vectors
[FEAT] Make postprocessing a separte step

Postprocessing is now separate, configurable step similiar to normalization, pretokenization or decoding.

Revert "[FIX] Changed decode API to work on vectors instead of singular tokens"

This reverts commit 08e1b399e4fafcecc78c1941b6331782f7d65469.

[REFACTOR] Split loading function of HFTokenizer
Revert tests as there's no longer vectorized decode api
[FIX] Fix handling of unknown tokens in bpm
[FIX] Added FuseDecoder implementation
Fix python bindings
[FIX] post_processor, remove silent fails

This commit, removes BertProcessor and RobertaProcessor skeleton
classes.

chore: Add test cases

This commit adds test cases for:

PieceToId logic
skip_special_tokens logic
PrependNormalizer

chore: add python binding for batch decode
chore: add post_processor to BUCK file
chore: fix formatting in token_decoder.h, remove placeholder code in post_processor.h
chore: change copyright handle to SWM
feat: add tests requested in review
chore: Unify logs in piece_to_id definitions
chore: fix tests to ensure parity with rust implementaiton outputs
chore: add python test for batch_decode

Differential Revision: D93019471

pytorch-bot · 2026-02-19T20:31:17Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17569

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

❌ 5 New Failures

As of commit c54996a with merge base a398a96 ():

NEW FAILURES - The following jobs have failed:

pull / test-mediatek-models-linux / linux-job (gh)
RuntimeError: Command docker exec -t 72c0eca1376ed93fb21fd17373083788c014d0ce94715dbc899af97ba4283247 /exec failed with exit code 2
pull / test-openvino-linux / linux-job (gh)
RuntimeError: Command docker exec -t da632bbb5c0fe0aec96429cea86fca91311e3f6d665819c82d02d23b7b3d3f15 /exec failed with exit code 1
pull / unittest-arm-backend-with-no-deps (test_pytest_ops_tosa) / linux-job (gh)
RuntimeError: Command docker exec -t 60e7546af6271fa381fbc5b3a16ceaa14e7ea5b609e8645e374317d3e381b851 /exec failed with exit code 1
pull / unittest-buck / linux / linux-job (gh)
RuntimeError: Command docker exec -t 8b261674537e937e6053ae0960efb7f6153d4ea4be5249cc23535feaa4a36214 /exec failed with exit code 3
pull / unittest-buck / macos / macos-job (gh)
RuntimeError: Command bash /Users/ec2-user/runner/_work/_temp/exec_script failed with exit code 3

This comment was automatically generated by Dr. CI and updates every 15 minutes.

meta-codesync · 2026-02-19T20:31:22Z

@larryliu0820 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D93019471.

Summary: Pull Request resolved: #17569 * [FEAT] Add prepend normalizer This commit introduces prepend normalizer, similiar to hugging faces's rust implementation. * [FEAT] Add "skip_special_tokens" parameter to the decode function Added decode function parameter, to optionaly skip decoding special tokens. Similiary to the HF Rust implementaiton. This change should be agnostic unless set to true. * [FEAT] Add funciton "piece_to_id" This commit introduces public member function that converts string to token id. This function is reverse of already existing 'id_to_piece' * [FEAT] Add handling of null pretokenizer and bytefallback json fields Added: - Handling of pretokenizer field explicitly set to null - Handling of bytefallback field along with encode logic * [FIX] Changed decode API to work on vectors instead of singular tokens * [REFACTOR] Changed tests to reflect new decode API * [FIX] Change decoders to work on vectors * [FEAT] Make postprocessing a separte step Postprocessing is now separate, configurable step similiar to normalization, pretokenization or decoding. * Revert "[FIX] Changed decode API to work on vectors instead of singular tokens" This reverts commit 08e1b399e4fafcecc78c1941b6331782f7d65469. * [REFACTOR] Split loading function of HFTokenizer * Revert tests as there's no longer vectorized decode api * [FIX] Fix handling of unknown tokens in bpm * [FIX] Added FuseDecoder implementation * Fix python bindings * [FIX] post_processor, remove silent fails This commit, removes BertProcessor and RobertaProcessor skeleton classes. * chore: Add test cases This commit adds test cases for: - PieceToId logic - skip_special_tokens logic - PrependNormalizer * chore: add python binding for batch decode * chore: add post_processor to BUCK file * chore: fix formatting in token_decoder.h, remove placeholder code in post_processor.h * chore: change copyright handle to SWM * feat: add tests requested in review * chore: Unify logs in piece_to_id definitions * chore: fix tests to ensure parity with rust implementaiton outputs * chore: add python test for batch_decode Differential Revision: D93019471

larryliu0820 requested a review from mergennachin as a code owner February 19, 2026 20:31

meta-cla bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Feb 19, 2026

meta-codesync bot added fb-exported meta-exported labels Feb 19, 2026

larryliu0820 added the release notes: llm Changes to llm utilities label Feb 19, 2026

larryliu0820 force-pushed the export-D93019471 branch from bd2104d to 239d0bf Compare February 19, 2026 20:35

larryliu0820 changed the title ~~Add missing HF functionalities (#170)~~ Bumping tokenizers pin to 1c432479f6b29ce8defc5d2e83375be9238fe1bf Feb 19, 2026

mergennachin approved these changes Feb 19, 2026

View reviewed changes

larryliu0820 force-pushed the export-D93019471 branch from 239d0bf to e48284e Compare February 19, 2026 23:25

larryliu0820 force-pushed the export-D93019471 branch from e48284e to c54996a Compare February 19, 2026 23:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bumping tokenizers pin to 1c432479f6b29ce8defc5d2e83375be9238fe1bf#17569

Bumping tokenizers pin to 1c432479f6b29ce8defc5d2e83375be9238fe1bf#17569
larryliu0820 wants to merge 1 commit intomainfrom
export-D93019471

larryliu0820 commented Feb 19, 2026 •

edited

Loading

Uh oh!

pytorch-bot bot commented Feb 19, 2026 •

edited

Loading

Uh oh!

meta-codesync bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

larryliu0820 commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pytorch-bot bot commented Feb 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/executorch/17569

❌ 5 New Failures

Uh oh!

meta-codesync bot commented Feb 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

larryliu0820 commented Feb 19, 2026 •

edited

Loading

pytorch-bot bot commented Feb 19, 2026 •

edited

Loading