Skip to content

fix: add disallowed_special on tiktoken encode #2102

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

questcollector
Copy link

When generating testset, transformation using token count using tiktoken.
but if documents includes special tokens like '' tiktoken raises error.
To prevent it, added parameter "disallowed_special=()" on usages of tiktoken encode

@dosubot dosubot bot added the size:XS This PR changes 0-9 lines, ignoring generated files. label Jul 4, 2025
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Summary

Fixed tiktoken encoding errors by adding disallowed_special=() parameter when processing text containing special tokens like '' during testset generation.

  • Modified ragas/src/ragas/testset/transforms/base.py to handle special tokens in LLMBasedExtractor.split_text_by_token_limit()
  • Updated ragas/src/ragas/utils.py to add special token handling in num_tokens_from_string() utility
  • Both changes prevent tiktoken from raising errors when encountering special tokens in documents

2 files reviewed, 1 comment
Edit PR Review Bot Settings | Greptile

Comment on lines 225 to 229
def num_tokens_from_string(string: str, encoding_name: str = "cl100k_base") -> int:
"""Returns the number of tokens in a text string."""
encoding = tiktoken.get_encoding(encoding_name)
num_tokens = len(encoding.encode(string))
num_tokens = len(encoding.encode(string, disallowed_special=()))
return num_tokens
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: Add docstring explaining the disallowed_special parameter and why it's set to empty tuple. This helps future maintainers understand the reasoning behind this configuration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@claude add this comment to explain it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size:XS This PR changes 0-9 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants