Hand-torn Code for LLM Interviews · Community-Driven Voting · Real-time Rankings
📖 Master these 100 topics, ace your LLM interview coding challenges
"Asked to implement Multi-Head Attention from scratch?"
"Can you write PPO, DPO, GRPO and explain the differences?"
"How does KV Cache work? What's the core idea of Flash Attention?"
| Feature | Description |
|---|---|
| 🎯 Real Interview Questions | Rankings driven by community votes |
| 📝 Detailed Comments | Every line of code clearly annotated |
| 🔥 Production-Ready | Numerical stability, edge cases handled |
| 🆚 Method Comparisons | Side-by-side comparisons of similar methods |
| ❓ Q&A Sections | Common questions answered proactively |
Legend: 🔥🔥🔥 Must Know | 🔥🔥 High Frequency | 🔥 Occasional | No mark = Good to know
👉 Vote Now to help calibrate the real interview frequency!
📖 LLM Basics → View
| # | Topic | Hot | Difficulty | One-liner |
|---|---|---|---|---|
| 1 | Gradient & Backprop | 🔥🔥 | ⭐⭐ | Chain rule, foundation of deep learning |
| 2 | Linear Regression | 🔥 | ⭐ | y = Wx + b, simplest model |
| 3 | Logistic Regression | 🔥🔥 | ⭐⭐ | sigmoid(Wx + b), binary classification |
| 4 | Softmax Regression | 🔥 | ⭐⭐ | Multi-class, LLM output layer |
| 5 | MLP | 🔥🔥 | ⭐⭐ | Universal approximator, FFN basis |
| 6 | Activation Functions | 🔥🔥 | ⭐ | ReLU/GELU/SiLU and gradients |
🧠 Attention Mechanisms → View
| # | Topic | Hot | Difficulty | One-liner |
|---|---|---|---|---|
| 7 | Scaled Dot-Product Attention | 🔥🔥🔥 | ⭐⭐⭐ | softmax(QK^T/√d)V, the foundation |
| 8 | Multi-Head Attention | 🔥🔥🔥 | ⭐⭐⭐⭐ | Parallel heads, different subspaces |
| 9 | Causal Mask | 🔥🔥🔥 | ⭐⭐ | Lower triangular, prevent future peeking |
| 10 | GQA | 🔥🔥🔥 | ⭐⭐⭐⭐ | Q heads > KV heads, LLaMA2 standard |
| 11 | MQA | 🔥🔥 | ⭐⭐⭐ | All Q share one KV |
| 12 | Flash Attention | 🔥🔥 | ⭐⭐⭐⭐⭐ | Tiled computation, IO-aware, O(N) memory |
| 13 | KV Cache | 🔥🔥🔥 | ⭐⭐⭐⭐ | Cache historical KV, avoid recomputation |
| 14 | Cross Attention | 🔥 | ⭐⭐⭐ | Q from decoder, KV from encoder |
📏 Normalization → View
| # | Topic | Hot | Difficulty | One-liner |
|---|---|---|---|---|
| 15 | Layer Normalization | 🔥🔥🔥 | ⭐⭐ | Normalize across features, Transformer standard |
| 16 | RMS Normalization | 🔥🔥🔥 | ⭐⭐ | No mean, just RMS, LLaMA uses it |
| 17 | Batch Normalization | 🔥 | ⭐⭐ | Normalize across batch, CNN common |
| 18 | Pre-Norm vs Post-Norm | 🔥🔥 | ⭐ | Pre-Norm more stable, modern LLM standard |
📍 Position Encoding → View
| # | Topic | Hot | Difficulty | One-liner |
|---|---|---|---|---|
| 19 | Sinusoidal PE | 🔥 | ⭐⭐ | sin/cos fixed, original Transformer |
| 20 | Learnable PE | 🔥 | ⭐ | Learnable embeddings, BERT/GPT |
| 21 | RoPE | 🔥🔥🔥 | ⭐⭐⭐⭐ | Complex rotation, relative position, LLM standard |
| 22 | ALiBi | 🔥🔥 | ⭐⭐⭐ | Linear bias in attention, good extrapolation |
🎲 Sampling Strategies → View
| # | Topic | Hot | Difficulty | One-liner |
|---|---|---|---|---|
| 23 | Greedy Decoding | 🔥 | ⭐ | Pick argmax each step |
| 24 | Temperature Sampling | 🔥🔥🔥 | ⭐⭐ | logits/T controls randomness |
| 25 | Top-k Sampling | 🔥🔥 | ⭐⭐ | Sample from top-k only |
| 26 | Top-p Sampling | 🔥🔥🔥 | ⭐⭐⭐ | Cumulative probability cutoff |
| 27 | Beam Search | 🔥🔥 | ⭐⭐⭐ | Keep k best sequences |
📉 Loss Functions → View
| # | Topic | Hot | Difficulty | One-liner |
|---|---|---|---|---|
| 28 | Cross Entropy Loss | 🔥🔥🔥 | ⭐⭐⭐ | -log(p_true), classification standard |
| 29 | LM Loss | 🔥🔥🔥 | ⭐⭐ | Autoregressive CE, next token prediction |
| 30 | KL Divergence | 🔥🔥 | ⭐⭐⭐ | Distribution difference, distillation/RLHF |
| 31 | MSE Loss | 🔥 | ⭐ | (y-ŷ)², regression |
| 32 | Focal Loss | 🔥 | ⭐⭐⭐ | Down-weight easy samples |
| 33 | SFT Loss | 🔥🔥 | ⭐⭐ | Masked CE, response only |
| 34 | Reward Model Loss | 🔥🔥 | ⭐⭐⭐ | -log σ(r_w - r_l), preference learning |
| 35 | Contrastive Loss | 🔥 | ⭐⭐⭐ | Pull positives, push negatives |
⚡ Optimizers → View
| # | Topic | Hot | Difficulty | One-liner |
|---|---|---|---|---|
| 36 | SGD | 🔥 | ⭐ | Basic w -= lr * grad |
| 37 | SGD + Momentum | 🔥 | ⭐⭐ | Add momentum, faster convergence |
| 38 | Adam | 🔥🔥🔥 | ⭐⭐⭐ | Adaptive LR, 1st & 2nd moments |
| 39 | AdamW | 🔥🔥🔥 | ⭐⭐⭐ | Decoupled weight decay, LLM standard |
| 40 | LR Schedule | 🔥🔥 | ⭐⭐ | Warmup + Cosine/Linear decay |
🎮 Reinforcement Learning (RLHF) → View
| # | Topic | Hot | Difficulty | One-liner |
|---|---|---|---|---|
| 41 | REINFORCE | 🔥 | ⭐⭐⭐ | Policy gradient ∇log π × R |
| 42 | GAE | 🔥🔥🔥 | ⭐⭐⭐⭐ | Advantage estimation, bias-variance tradeoff |
| 43 | PPO | ��🔥🔥 | ⭐⭐⭐⭐⭐ | Clip ratio, RLHF core |
| 44 | PPO-Clip | ��🔥🔥 | ⭐⭐⭐⭐ | Clipped objective version |
| 45 | DPO | 🔥��🔥 | ⭐⭐⭐⭐ | Direct preference optimization, no RM |
| 46 | GRPO | 🔥🔥🔥 | ⭐⭐⭐⭐⭐ | Group relative policy, DeepSeek uses |
| 47 | KL Penalty | 🔥🔥 | ⭐⭐ | Prevent diverging from reference |
| 48 | Reward Shaping | 🔥 | ⭐⭐⭐ | Reward engineering, sparse → dense |
🚀 Efficient Training → View
| # | Topic | Hot | Difficulty | One-liner |
|---|---|---|---|---|
| 49 | LoRA | 🔥🔥🔥 | ⭐⭐⭐⭐ | Low-rank decomposition W + BA |
| 50 | QLoRA | 🔥🔥 | ⭐⭐⭐⭐ | LoRA + 4bit quantization |
| 51 | Gradient Checkpointing | 🔥🔥 | ⭐⭐⭐ | Trade time for memory |
| 52 | Mixed Precision | 🔥🔥 | ⭐⭐⭐ | FP16/BF16, less memory, faster |
| 53 | Gradient Accumulation | 🔥🔥 | ⭐⭐ | Small batch simulates large batch |
⚡ Inference Optimization → View
| # | Topic | Hot | Difficulty | One-liner |
|---|---|---|---|---|
| 54 | KV Cache | 🔥🔥🔥 | ⭐⭐⭐⭐ | Cache KV, speed up autoregressive |
| 55 | Paged Attention | 🔥🔥 | ⭐⭐⭐⭐ | Paged KV management, vLLM core |
| 56 | Speculative Decoding | 🔥🔥 | ⭐⭐⭐⭐ | Small model drafts, large verifies |
| 57 | Continuous Batching | 🔥🔥 | ⭐⭐⭐ | Dynamic batching, higher throughput |
| 58 | Quantization | 🔥🔥 | ⭐⭐⭐ | INT8/INT4, half+ memory |
🏗️ Transformer Architecture → View
| # | Topic | Hot | Difficulty | One-liner |
|---|---|---|---|---|
| 59 | Encoder-Only (BERT) | 🔥 | ⭐⭐⭐ | Bidirectional, understanding tasks |
| 60 | Decoder-Only (GPT) | 🔥🔥🔥 | ⭐⭐⭐ | Causal attention, generation, LLM standard |
| 61 | Encoder-Decoder (T5) | 🔥 | ⭐⭐⭐ | Seq2seq, translation/summarization |
| 62 | FFN | 🔥🔥 | ⭐⭐ | 2-layer MLP, 4x expansion |
| 63 | SwiGLU | 🔥🔥 | ⭐⭐⭐ | Gated FFN, LLaMA uses |
Community-driven, updated hourly via GitHub Actions
Last updated: 2026-04-18
| Rank | Topic | Category | Votes |
|---|---|---|---|
| 🥇 | Scaled Dot-Product Attention | Attention | 🔥 4 |
| 🥈 | Gradient & Backprop | Basics | 🔥 3 |
| 🥉 | Linear Regression | Basics | 🔥 3 |
| 4 | Multi-Head Attention | Attention | 🔥 3 |
| 5 | Logistic Regression | Basics | 🔥 2 |
| 6 | BatchNorm | Norm | 🔥 2 |
| 7 | Cross Entropy | Loss | 🔥 2 |
| 8 | LayerNorm | Norm | 🔥 2 |
| 9 | GQA | Attention | 🔥 1 |
| 10 | RoPE | Position | 🔥 1 |
| 11 | DPO | RL | 🔥 1 |
| 12 | Causal Mask | Attention | 🔥 1 |
| 13 | Top-k | Sampling | 🔥 1 |
| 14 | Top-p | Sampling | 🔥 1 |
| 15 | Beam Search | Sampling | 🔥 1 |
| 16 | Decoder-Only | Arch | 🔥 1 |
| 17 | FFN | Arch | 🔥 1 |
| 18 | GAE | RL | 🔥 1 |
| 19 | GRPO | RL | 🔥 1 |
| 20 | KL Penalty | RL | 🔥 1 |
Your interview experience matters! Help calibrate real interview frequencies.
- 🗳️ Vote for topics you've seen in interviews
- 🏆 Real-time leaderboard
- 💬 Share your experience
Contributions welcome! New topics, bug fixes, documentation improvements.
- Fork this repo
- Create feature branch
git checkout -b feature/new-topic - Commit changes
git commit -m 'Add: XXX' - Push branch
git push origin feature/new-topic - Submit Pull Request
- Attention Is All You Need
- RoFormer: Enhanced Transformer with Rotary Position Embedding
- Training language models to follow instructions with human feedback
- Direct Preference Optimization
- DeepSeekMath: Pushing the Limits of Mathematical Reasoning
If this helps, please give it a star ⭐!
Made with ❤️ for LLM Interview Preparation
The Hot 100 for the LLM Era
#LLMHot100