Enhance MythosTokenizer with new methods by krataratha · Pull Request #30 · kyegomez/OpenMythos

krataratha · 2026-04-21T14:41:13Z

Added methods for encoding, decoding, token counting, batch encoding, retrieving special tokens, and checking token limits.

Copilot

Pull request overview

This PR aims to expand MythosTokenizer (a HuggingFace AutoTokenizer wrapper) with additional convenience APIs for encoding/decoding, token counting, batch encoding, special-token introspection, and token-limit checks.

Changes:

Adds utility methods: token_count, batch_encode, get_special_tokens, and is_within_limit.
Introduces new encode/decode wrappers near the top of the class (currently duplicating existing ones later in the file).
Refactors/removes portions of the earlier docstring content (but leaves behind malformed text in the implementation).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-21T14:44:37Z

+    def encode(self, text: str):
+        return self.tokenizer.encode(text)
+
+    def decode(self, token_ids):
+        return self.tokenizer.decode(token_ids)


This file now defines encode and decode twice; the later definitions override the earlier ones, so the newly-added encode/decode methods here are dead code and can confuse behavior (especially since their defaults differ from the later versions). Keep only one pair of encode/decode wrappers and make the other methods call those wrappers for consistency.

Copilot · 2026-04-21T14:44:37Z

+    # ✅ New methods added
+    def token_count(self, text: str) -> int:
+        """Return number of tokens in text."""
+        return len(self.tokenizer.encode(text))


token_count calls self.tokenizer.encode(text) directly, which may count tokens differently than your wrapper encode method (e.g., add_special_tokens defaults). To avoid inconsistent behavior across the wrapper API, have token_count compute the length using the same encoding path/options as encode.

Suggested change

return len(self.tokenizer.encode(text))

return len(self.encode(text))

Copilot · 2026-04-21T14:44:38Z

+    def token_count(self, text: str) -> int:
+        """Return number of tokens in text."""
+        return len(self.tokenizer.encode(text))
+
+    def batch_encode(self, texts: list[str], padding: bool = True, truncation: bool = True):
+        """Encode multiple texts at once."""
+        return self.tokenizer(
+            texts,
+            padding=padding,
+            truncation=truncation,
+            return_tensors="pt"
+        )
+
+    def get_special_tokens(self):
+        """Return special tokens used by tokenizer."""
+        return self.tokenizer.special_tokens_map
+
+    def is_within_limit(self, text: str, max_tokens: int) -> bool:
+        """Check if text fits within a token limit."""
+        return self.token_count(text) <= max_tokens        Returns:


New public methods (token_count, batch_encode, get_special_tokens, is_within_limit) are added to MythosTokenizer, but there are existing tokenizer tests and none cover these new behaviors. Add tests validating token counting/limits and that batch_encode returns the expected shapes/types (and respects padding/truncation), to prevent regressions.

Copilot · 2026-04-21T14:44:38Z

+        return self.token_count(text) <= max_tokens        Returns:
            int: The number of unique tokens in the tokenizer vocabulary.
        """
        return self.tokenizer.vocab_size


is_within_limit currently contains stray docstring text (Returns: …) on the same line as the return statement, which will raise a SyntaxError on import and also leaves a duplicate return self.tokenizer.vocab_size block inside this method. Remove the stray docstring fragment and ensure is_within_limit only returns the boolean check (and that vocab_size logic remains only in the vocab_size property).

Suggested change

return self.token_count(text) <= max_tokens Returns:

int: The number of unique tokens in the tokenizer vocabulary.

"""

return self.tokenizer.vocab_size

return self.token_count(text) <= max_tokens

Enhance MythosTokenizer with new methods

331dd0a

Added methods for encoding, decoding, token counting, batch encoding, retrieving special tokens, and checking token limits.

Copilot AI review requested due to automatic review settings April 21, 2026 14:41

Copilot started reviewing on behalf of krataratha April 21, 2026 14:41 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enhance MythosTokenizer with new methods#30

Enhance MythosTokenizer with new methods#30
krataratha wants to merge 1 commit intokyegomez:mainfrom
krataratha:patch-1

krataratha commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	return len(self.tokenizer.encode(text))
	return len(self.encode(text))

Conversation

krataratha commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants