Skip to content

Gemma sp tokenizer#583

Open
aarnav-11 wants to merge 5 commits intocactus-compute:mainfrom
aarnav-11:gemma-sp-tokenizer
Open

Gemma sp tokenizer#583
aarnav-11 wants to merge 5 commits intocactus-compute:mainfrom
aarnav-11:gemma-sp-tokenizer

Conversation

@aarnav-11
Copy link
Copy Markdown

Closes: #577

  • Fixed ambiguity of plain token text and token tab score by creating a new vocab format that distinguishes between the two formats
  • Removed exception driven float parsing
  • Reconfigured the converter to align with the new vocab format by remaining backwards compatible with the plain token text

@aarnav-11 aarnav-11 marked this pull request as draft April 15, 2026 02:28
@aarnav-11 aarnav-11 marked this pull request as ready for review April 15, 2026 02:29
@aarnav-11 aarnav-11 marked this pull request as draft April 15, 2026 02:29
@aarnav-11 aarnav-11 marked this pull request as ready for review April 15, 2026 02:34
@HenryNdubuaku HenryNdubuaku requested a review from ParkiratS April 15, 2026 03:52
@aarnav-11 aarnav-11 closed this Apr 21, 2026
@aarnav-11 aarnav-11 deleted the gemma-sp-tokenizer branch April 21, 2026 05:14
@aarnav-11 aarnav-11 restored the gemma-sp-tokenizer branch April 21, 2026 05:14
@aarnav-11 aarnav-11 reopened this Apr 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

f8afc46 regresses Gemma tokenizer loading: tab tokens in vocab.txt cause std::stof exception during init

2 participants