Skip to content

Enhance MythosTokenizer with new methods#30

Open
krataratha wants to merge 1 commit intokyegomez:mainfrom
krataratha:patch-1
Open

Enhance MythosTokenizer with new methods#30
krataratha wants to merge 1 commit intokyegomez:mainfrom
krataratha:patch-1

Conversation

@krataratha
Copy link
Copy Markdown

Added methods for encoding, decoding, token counting, batch encoding, retrieving special tokens, and checking token limits.

Added methods for encoding, decoding, token counting, batch encoding, retrieving special tokens, and checking token limits.
Copilot AI review requested due to automatic review settings April 21, 2026 14:41
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to expand MythosTokenizer (a HuggingFace AutoTokenizer wrapper) with additional convenience APIs for encoding/decoding, token counting, batch encoding, special-token introspection, and token-limit checks.

Changes:

  • Adds utility methods: token_count, batch_encode, get_special_tokens, and is_within_limit.
  • Introduces new encode/decode wrappers near the top of the class (currently duplicating existing ones later in the file).
  • Refactors/removes portions of the earlier docstring content (but leaves behind malformed text in the implementation).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread open_mythos/tokenizer.py
Comment on lines +18 to +22
def encode(self, text: str):
return self.tokenizer.encode(text)

def decode(self, token_ids):
return self.tokenizer.decode(token_ids)
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file now defines encode and decode twice; the later definitions override the earlier ones, so the newly-added encode/decode methods here are dead code and can confuse behavior (especially since their defaults differ from the later versions). Keep only one pair of encode/decode wrappers and make the other methods call those wrappers for consistency.

Copilot uses AI. Check for mistakes.
Comment thread open_mythos/tokenizer.py
# ✅ New methods added
def token_count(self, text: str) -> int:
"""Return number of tokens in text."""
return len(self.tokenizer.encode(text))
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

token_count calls self.tokenizer.encode(text) directly, which may count tokens differently than your wrapper encode method (e.g., add_special_tokens defaults). To avoid inconsistent behavior across the wrapper API, have token_count compute the length using the same encoding path/options as encode.

Suggested change
return len(self.tokenizer.encode(text))
return len(self.encode(text))

Copilot uses AI. Check for mistakes.
Comment thread open_mythos/tokenizer.py
Comment on lines +25 to +44
def token_count(self, text: str) -> int:
"""Return number of tokens in text."""
return len(self.tokenizer.encode(text))

def batch_encode(self, texts: list[str], padding: bool = True, truncation: bool = True):
"""Encode multiple texts at once."""
return self.tokenizer(
texts,
padding=padding,
truncation=truncation,
return_tensors="pt"
)

def get_special_tokens(self):
"""Return special tokens used by tokenizer."""
return self.tokenizer.special_tokens_map

def is_within_limit(self, text: str, max_tokens: int) -> bool:
"""Check if text fits within a token limit."""
return self.token_count(text) <= max_tokens Returns:
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New public methods (token_count, batch_encode, get_special_tokens, is_within_limit) are added to MythosTokenizer, but there are existing tokenizer tests and none cover these new behaviors. Add tests validating token counting/limits and that batch_encode returns the expected shapes/types (and respects padding/truncation), to prevent regressions.

Copilot uses AI. Check for mistakes.
Comment thread open_mythos/tokenizer.py
Comment on lines +44 to 47
return self.token_count(text) <= max_tokens Returns:
int: The number of unique tokens in the tokenizer vocabulary.
"""
return self.tokenizer.vocab_size
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_within_limit currently contains stray docstring text (Returns: …) on the same line as the return statement, which will raise a SyntaxError on import and also leaves a duplicate return self.tokenizer.vocab_size block inside this method. Remove the stray docstring fragment and ensure is_within_limit only returns the boolean check (and that vocab_size logic remains only in the vocab_size property).

Suggested change
return self.token_count(text) <= max_tokens Returns:
int: The number of unique tokens in the tokenizer vocabulary.
"""
return self.tokenizer.vocab_size
return self.token_count(text) <= max_tokens

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants