This project implements a Byte Pair Encoding (BPE) tokenizer, a subword tokenization algorithm commonly used in Natural Language Processing (NLP) tasks. The implementation includes both a basic version and an advanced version using regex for improved tokenization.
BPE_basic.py: Basic implementation of the BPE algorithmBPE_regex.py: Advanced implementation using regex for improved tokenizationtest.ipynb: Jupyter notebook for testing and demonstrating the tokenizerdata/: Folder containing sample texts for tokenization
This file contains the BytePairEncoding class, which implements the core BPE algorithm. It includes methods for:
- Training the tokenizer on a given text
- Encoding text into token IDs
- Decoding token IDs back into text
The basic version operates on UTF-8 encoded bytes of the input text.
This file contains the BytePairEncodingRegex class, which extends the basic BPE implementation with regex-based tokenization. Key features include:
- Use of the GPT-4 split pattern for initial tokenization
- Improved handling of subwords and special characters
- Option to customize the regex pattern
This Jupyter notebook is used for testing and demonstrating the functionality of both BPE implementations. It likely includes:
- Example usage of both
BytePairEncodingandBytePairEncodingRegexclasses - Comparisons between the basic and regex-based implementations
- Visualization of tokenization results
The data/ folder contains sample texts for tokenization. These texts can be used to:
- Train the BPE tokenizer
- Test the encoding and decoding processes
- Compare the performance of different tokenization strategies
To use the data:
- Place your text files in the
data/folder - Load the texts in your Python script or Jupyter notebook
- Use the loaded texts to train and test the BPE tokenizer
For more detailed examples and usage, refer to the test.ipynb notebook.
Applying both tokenizers to the same random text, we see the following results, showcasing noticable improvements in tokenization using the regex-based approach:
Basic BPE Tokenizer: Lenght: 9760 vs 25908 -> 2.655X compression rate
Regex-based BPE Tokenizer: Lenght: 9081 vs 25908 -> 2.853X compression rate
This repository is the companion code for the Medium article Cracking the Code: The Language LLMs speak..