Enhance MythosTokenizer with new methods#30
Conversation
Added methods for encoding, decoding, token counting, batch encoding, retrieving special tokens, and checking token limits.
There was a problem hiding this comment.
Pull request overview
This PR aims to expand MythosTokenizer (a HuggingFace AutoTokenizer wrapper) with additional convenience APIs for encoding/decoding, token counting, batch encoding, special-token introspection, and token-limit checks.
Changes:
- Adds utility methods:
token_count,batch_encode,get_special_tokens, andis_within_limit. - Introduces new
encode/decodewrappers near the top of the class (currently duplicating existing ones later in the file). - Refactors/removes portions of the earlier docstring content (but leaves behind malformed text in the implementation).
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def encode(self, text: str): | ||
| return self.tokenizer.encode(text) | ||
|
|
||
| def decode(self, token_ids): | ||
| return self.tokenizer.decode(token_ids) |
There was a problem hiding this comment.
This file now defines encode and decode twice; the later definitions override the earlier ones, so the newly-added encode/decode methods here are dead code and can confuse behavior (especially since their defaults differ from the later versions). Keep only one pair of encode/decode wrappers and make the other methods call those wrappers for consistency.
| # ✅ New methods added | ||
| def token_count(self, text: str) -> int: | ||
| """Return number of tokens in text.""" | ||
| return len(self.tokenizer.encode(text)) |
There was a problem hiding this comment.
token_count calls self.tokenizer.encode(text) directly, which may count tokens differently than your wrapper encode method (e.g., add_special_tokens defaults). To avoid inconsistent behavior across the wrapper API, have token_count compute the length using the same encoding path/options as encode.
| return len(self.tokenizer.encode(text)) | |
| return len(self.encode(text)) |
| def token_count(self, text: str) -> int: | ||
| """Return number of tokens in text.""" | ||
| return len(self.tokenizer.encode(text)) | ||
|
|
||
| def batch_encode(self, texts: list[str], padding: bool = True, truncation: bool = True): | ||
| """Encode multiple texts at once.""" | ||
| return self.tokenizer( | ||
| texts, | ||
| padding=padding, | ||
| truncation=truncation, | ||
| return_tensors="pt" | ||
| ) | ||
|
|
||
| def get_special_tokens(self): | ||
| """Return special tokens used by tokenizer.""" | ||
| return self.tokenizer.special_tokens_map | ||
|
|
||
| def is_within_limit(self, text: str, max_tokens: int) -> bool: | ||
| """Check if text fits within a token limit.""" | ||
| return self.token_count(text) <= max_tokens Returns: |
There was a problem hiding this comment.
New public methods (token_count, batch_encode, get_special_tokens, is_within_limit) are added to MythosTokenizer, but there are existing tokenizer tests and none cover these new behaviors. Add tests validating token counting/limits and that batch_encode returns the expected shapes/types (and respects padding/truncation), to prevent regressions.
| return self.token_count(text) <= max_tokens Returns: | ||
| int: The number of unique tokens in the tokenizer vocabulary. | ||
| """ | ||
| return self.tokenizer.vocab_size |
There was a problem hiding this comment.
is_within_limit currently contains stray docstring text (Returns: …) on the same line as the return statement, which will raise a SyntaxError on import and also leaves a duplicate return self.tokenizer.vocab_size block inside this method. Remove the stray docstring fragment and ensure is_within_limit only returns the boolean check (and that vocab_size logic remains only in the vocab_size property).
| return self.token_count(text) <= max_tokens Returns: | |
| int: The number of unique tokens in the tokenizer vocabulary. | |
| """ | |
| return self.tokenizer.vocab_size | |
| return self.token_count(text) <= max_tokens |
Added methods for encoding, decoding, token counting, batch encoding, retrieving special tokens, and checking token limits.