yapformer_demo.mp4
YapFormer is a transformer model built entirely from scratch, featuring modern architectural components and efficient training optimizations.
The final model contains ~56 million parameters and was trained for 15,000 steps (~4.5 hours) on the TinyStories dataset.
Despite the small size and short training time, YapFormer produces surprisingly high-quality short stories, demonstrating that well-designed architectures can go a long way even with limited compute.
YapFormer is a from-scratch GPT-style autoregressive transformer that integrates many techniques used in contemporary LLMs:
-
Rotary Embeddings (RoPE)
-
Grouped Query Attention (GQA)
-
KV caching for fast inference
-
RMSNorm
-
SwiGLU feed-forward layers
-
Mixed precision training
-
Gradient accumulation
-
Cosine decay learning rate
-
Gradient clipping
This project serves as both a learning exercise and a practical lightweight generative model.
-
Tokens are mapped using a custom tokenizer.
-
RoPE is applied to attention queries/keys instead of absolute positional embeddings.
-
Grouped Query Attention (GQA):
Multiple query heads share a smaller number of key/value heads → faster and more memory-efficient. -
KV Caching:
During inference, previous keys/values are stored so the model only attends to new tokens.
Each block contains:
-
RMSNorm
-
Multi-Head Attention (with RoPE, GQA, KV cache)
-
SwiGLU feed-forward network
-
Residual connections
-
Final RMSNorm
-
Linear layer → logits → softmax for next-token prediction
Modern GPU-friendly techniques:
-
AMP mixed precision for speed + memory efficiency
-
Gradient accumulation to simulate large batch sizes
-
Cosine LR decay for smooth convergence
-
Gradient clipping to prevent instability
Model Structure
Model Structure (Decoder-Only Transformer)
Token Embedding
↓
Rotary Positional Encoding (RoPE)
↓
N × Transformer Blocks
├─ RMSNorm
├─ Grouped Query Attention (GQA + KV Cache)
├─ Residual Connection
├─ RMSNorm
├─ SwiGLU Feed-Forward
└─ Residual Connection
↓
Final RMSNorm
↓
Linear Language Modeling Head
-
Language: Python
-
Framework: PyTorch
-
Built With:
-
Custom attention mechanisms
-
Custom embeddings
-
Custom RMSNorm + SwiGLU layers
-
Mixed precision training tools
-
-
Ecosystem Tools:
-
🤗 Hugging Face (datasets/tokenization)
-
PyTorch (core autograd & tensor ops)
-
git clone https://github.com/Aravind-808/YapFormer
cd YapFormer
pip install -r requirements.txt
python inference.py
Prompt: Once upon a time
Once upon a time there was a tiny mouse who loved reading stories...