Byte Pair Encoding (BPE) Tokenizer

This project implements a Byte Pair Encoding (BPE) tokenizer, a subword tokenization algorithm commonly used in Natural Language Processing (NLP) tasks. The implementation includes both a basic version and an advanced version using regex for improved tokenization.

Project Structure

BPE_basic.py: Basic implementation of the BPE algorithm
BPE_regex.py: Advanced implementation using regex for improved tokenization
test.ipynb: Jupyter notebook for testing and demonstrating the tokenizer
data/: Folder containing sample texts for tokenization

File Descriptions

BPE_basic.py

This file contains the BytePairEncoding class, which implements the core BPE algorithm. It includes methods for:

Training the tokenizer on a given text
Encoding text into token IDs
Decoding token IDs back into text

The basic version operates on UTF-8 encoded bytes of the input text.

BPE_regex.py

This file contains the BytePairEncodingRegex class, which extends the basic BPE implementation with regex-based tokenization. Key features include:

Use of the GPT-4 split pattern for initial tokenization
Improved handling of subwords and special characters
Option to customize the regex pattern

test.ipynb

This Jupyter notebook is used for testing and demonstrating the functionality of both BPE implementations. It likely includes:

Example usage of both BytePairEncoding and BytePairEncodingRegex classes
Comparisons between the basic and regex-based implementations
Visualization of tokenization results

Data Folder

The data/ folder contains sample texts for tokenization. These texts can be used to:

Train the BPE tokenizer
Test the encoding and decoding processes
Compare the performance of different tokenization strategies

To use the data:

Place your text files in the data/ folder
Load the texts in your Python script or Jupyter notebook
Use the loaded texts to train and test the BPE tokenizer

For more detailed examples and usage, refer to the test.ipynb notebook.

Results

Applying both tokenizers to the same random text, we see the following results, showcasing noticable improvements in tokenization using the regex-based approach:

Basic BPE Tokenizer: Lenght: 9760 vs 25908 -> 2.655X compression rate

Regex-based BPE Tokenizer: Lenght: 9081 vs 25908 -> 2.853X compression rate

Medium Article

This repository is the companion code for the Medium article Cracking the Code: The Language LLMs speak..

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Byte Pair Encoding (BPE) Tokenizer

Project Structure

File Descriptions

BPE_basic.py

BPE_regex.py

test.ipynb

Data Folder

Results

Medium Article

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
BPE_basic.py		BPE_basic.py
BPE_regex.py		BPE_regex.py
README.md		README.md
test.ipynb		test.ipynb

IgnacioCorrecher/BPE_Tokenizer

Folders and files

Latest commit

History

Repository files navigation

Byte Pair Encoding (BPE) Tokenizer

Project Structure

File Descriptions

BPE_basic.py

BPE_regex.py

test.ipynb

Data Folder

Results

Medium Article

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages