This repository serves as a personal learning journey through important papers in deep learning, starting with foundational architectures and gradually expanding to more complex models. Each implementation is meant to be a clean, educational reference point with a focus on understanding the core concepts.
| Paper | Implementation | Key Concepts |
|---|---|---|
| Attention Is All You Need | transformer-implementation/ | - Multi-Head Attention - Positional Encoding - Layer Normalization - Label Smoothing - Warmup Learning Rate |
| Neural Machine Translation by Jointly Learning to Align and Translate | BPE/ | - Byte Pair Encoding - Subword Tokenization - Vocabulary Building - Special Token Handling |
| Language Models are Unsupervised Multitask Learners | gpt-2/ | - Transformer Decoder - Autoregressive Language Modeling - Transfer Learning - Advanced Text Generation |
The current implementation includes a complete transformer architecture with:
- Multi-headed self-attention mechanism
- Position-wise feed-forward networks
- Positional encodings
- Layer normalization
- Encoder and decoder stacks
- Label smoothing
- Learning rate scheduling with warmup
The BPE (Byte Pair Encoding) tokenizer implementation is inspired by Sebastian Raschka's work and includes:
- Complete training algorithm to learn subword tokens from a corpus
- Efficient encoding and decoding methods with merge prioritization
- Full support for special tokens and Unicode characters
- Space preprocessing using 'Ġ' character (following GPT tokenizer convention)
- OpenAI-compatible format loader for GPT-2 vocabularies
- Performance optimizations with caching mechanisms
- Regex-based tokenization for faster processing
The GPT-2 implementation is inspired by Andrej Karpathy's work with many optimizations to be made. It features:
- Transformer decoder architecture
- Autoregressive language modeling
- Pre-training and fine-tuning capabilities
- Text generation with various sampling strategies (temperature, top-k, top-p)
- Efficient attention patterns for improved training
- Educational implementation focusing on clarity and understanding
These implementations are meant for educational purposes and self-reference. While they aim to be correct, they may not be optimized for production use. They serve as a starting point for understanding the underlying concepts and architectures described in the papers.