The official Nanochat extension for SAEs and mechanistic interpretability.
Thanks to @Karpathy for the encouragement.

Train your own ChatGPT from scratch. Then understand what it learned.
This is a research fork of Andrej Karpathy's nanochat extended with Sparse Autoencoder (SAE) based interpretability tools created and maintained by Caleb DeLeeuw. You get the full nanochat training pipeline PLUS the ability to peer inside your model and discover the features it learned.
nanochat teaches you to build a ChatGPT-like model for ~$100. nanochat-SAE teaches you to understand what that model actually learned.
Using Sparse Autoencoders, you can:
🔍 Discover interpretable features - Find "negation neurons", "math neurons", "sentiment neurons" 📊 Visualize activations - See which concepts light up during inference 🎛️ Steer behavior - Amplify or suppress specific features to change model outputs 🧪 Debug learning - Understand why your model succeeds or fails on tasks 🌐 Share discoveries - Export to Neuronpedia for community analysis
Philosophy: Karpathy's nanochat is intentionally minimal (~8,000 lines) for educational clarity. SAE interpretability adds ~4,000 lines of advanced tooling. Rather than compromise nanochat's minimalism, we maintain this as a full-featured research branch for those who want to go deeper.
What you get here:
✅ Complete nanochat codebase (train your own LLM) ✅ SAE training pipeline (TopK, ReLU, Gated architectures) ✅ Activation collection via PyTorch hooks ✅ Feature visualization and dashboards ✅ Runtime interpretation and steering ✅ Neuronpedia integration ✅ Comprehensive documentation ✅ Google Colab notebook for training SAEs on free T4 GPU
New! Train SAEs on a pre-trained nanochat model using Google Colab's free T4 GPU in 1-2 hours:
How to run:
- Click the badge above to open in Colab
- Go to Runtime → Change runtime type → Select T4 GPU
- Run all cells (setup takes ~5-10 minutes)
- Upload a pre-trained checkpoint or point to one in Google Drive
- (Optional) Upload your custom reference dataset
- Train SAE and visualize learned features!
Perfect for:
- 🎓 Learning SAE interpretability without expensive hardware
- 🧪 Quick experiments with custom reference datasets
- 📊 Analyzing features from pre-trained models
- 💡 Testing ideas before scaling up
NEW! Train SAEs using Anthropic's public datasets of LLM deceptive behavior to enable automatic feature labeling:
What's different:
- 📊 Uses Anthropic's Alignment Faking, Sleeper Agents, and Agentic Misalignment datasets
- 🏷️ Auto-labels SAE features based on deception-relevant contexts
- 🔬 Includes deception-specific evaluation metrics
- 🎯 Tests if contextualized labeling makes SAEs more useful for deception detection
- ⚖️ Compares features learned from deceptive vs. honest behavior
How to run:
- Click the badge above to open in Colab
- Go to Runtime → Change runtime type → Select T4 GPU
- Run all cells - the notebook will automatically download Anthropic's datasets
- Upload a pre-trained checkpoint or point to one in Google Drive
- Train SAE with deception-labeled activations
- Explore auto-labeled features for deception detection!
Perfect for:
- 🛡️ Deception and misalignment detection research
- 🔬 Studying how models represent deceptive behavior
- 🏷️ Testing automatic feature labeling approaches
- 📊 Comparing interpretability approaches with/without context labels
See COLAB_GUIDE.md for detailed instructions and troubleshooting.
This repo includes the full nanochat training pipeline:
# Clone this repo
git clone https://github.com/SolshineCode/nanochat-sae
cd nanochat-sae
# Run the nanochat speedrun (trains a model in ~4 hours on 8xH100)
bash speedrun.shYour trained model checkpoint will be at models/d20/base_final.pt.
Already have a nanochat model? Just point the SAE scripts at your checkpoint.
Train SAEs to decompose your model's learned features:
# Train SAE on layer 10 of your d20 model
python -m scripts.sae_train \
--checkpoint models/d20/base_final.pt \
--layer 10 \
--expansion_factor 8 \
--activation topk \
--k 64 \
--num_activations 1000000This collects 1M activations from layer 10 and trains a TopK SAE with 8x expansion (10,240 features for a d20 model with 1,280 hidden dims).
Training time: ~2-4 hours on a single A100.
Check how well your SAE captures the model's representations:
python -m scripts.sae_eval \
--sae_path sae_outputs/layer_10/best_model.pt \
--generate_dashboards \
--top_k 20Key metrics:
- Reconstruction MSE: How accurately the SAE reconstructs activations
- L0 sparsity: Average number of active features (should be close to k)
- Explained variance: Fraction of activation variance captured
- Dead latents: Percentage of features that never activate
Generate interactive dashboards to explore what features your model learned:
python -m scripts.sae_viz \
--sae_path sae_outputs/layer_10/best_model.pt \
--all_features \
--top_k 50 \
--output_dir feature_explorerOpen feature_explorer/index.html in your browser to see:
- Top activating features
- Activation frequencies
- Example inputs that trigger each feature
- Feature statistics
Track feature activations during inference:
from nanochat.gpt import GPT
from sae.runtime import InterpretableModel, load_saes
# Load your trained model
model = GPT.from_pretrained("models/d20/base_final.pt")
# Load trained SAEs
saes = load_saes("sae_outputs/")
# Wrap with interpretability
interp_model = InterpretableModel(model, saes)
# Track features during generation
with interp_model.interpretation_enabled():
output = interp_model(input_ids)
features = interp_model.get_active_features()
# See which features fired in layer 10
layer_10_features = features["blocks.10.hook_resid_post"]
print(f"Active features: {(layer_10_features > 0).sum()} / {layer_10_features.shape[1]}")Example findings from SAE interpretability on small language models:
- 🚫 Negation features: Activate on "not", "never", "isn't"
- 🔢 Numerical features: Fire on digits, math operations
- 😊 Sentiment features: Distinguish positive/negative language
- 🌍 Entity features: Activate on proper nouns, locations
- 📚 Syntax features: Capture grammatical structures
Using the deception-focused training notebook, you can discover:
- 🎭 Deception features: Activate when model generates misleading content
- 🔒 Alignment faking features: Fire when model pretends to comply
- 🚨 Backdoor behavior features: Detect conditional malicious behavior
- ⚖️ Honest vs. deceptive patterns: Compare activation patterns
- 🧠 Self-awareness features: Track when model discusses its own capabilities
- 🛡️ Safety bypass features: Identify features related to bypassing safety measures
Feature steering example:
# Amplify a "politeness" feature
polite_output = interp_model.steer(
input_ids,
feature_id=("blocks.15.hook_resid_post", 4232),
strength=2.0 # 2x amplification
)
# Suppress a deception-related feature
honest_output = interp_model.steer(
input_ids,
feature_id=("blocks.10.hook_resid_post", 1337),
strength=-3.0 # Strong suppression
)nanochat-sae/
├── README.md # This file
├── colab_sae_training.ipynb # Standard SAE training notebook
├── colab_sae_deception_training.ipynb # 🆕 Deception-focused SAE training
├── speedrun.sh # Train nanochat model (original)
├── nanochat/ # Core nanochat implementation
├── scripts/
│ ├── base_train.py # Nanochat pretraining
│ ├── mid_train.py # Nanochat midtraining
│ ├── chat_sft.py # Nanochat supervised fine-tuning
│ ├── sae_train.py # 🆕 Train SAEs on activations
│ ├── sae_eval.py # 🆕 Evaluate SAE quality
│ └── sae_viz.py # 🆕 Visualize features
├── sae/ # 🆕 SAE implementation
│ ├── config.py # SAE configuration
│ ├── models.py # TopK, ReLU, Gated SAEs
│ ├── hooks.py # Activation collection
│ ├── trainer.py # SAE training loop
│ ├── runtime.py # Real-time interpretation
│ ├── evaluator.py # Evaluation metrics
│ ├── feature_viz.py # Visualization tools
│ └── neuronpedia.py # Neuronpedia integration
├── tests/
│ └── test_sae.py # 🆕 SAE implementation tests
└── examples/ # 🚧 Coming soon!
└── tutorials/ # Step-by-step guides
- Train nanochat first - Follow the main nanochat tutorial to understand the base model
- Read SAE basics - Understand what Sparse Autoencoders do (Anthropic's explainer)
- Run simple example - Train a single SAE on one layer
- Explore features - Use visualization tools to see what your model learned
- Multi-layer analysis - Train SAEs on multiple layers, compare features
- Feature steering - Modify model behavior by intervening on features
- Scaling studies - Compare features across d20, d26, d30 models
- Circuit discovery - Find chains of features that implement capabilities
- Publish findings - Share discoveries via Neuronpedia or papers
- Integration - Add SAE hooks to your own training loop
- Custom architectures - Extend with new SAE variants
- Production tools - Build monitoring dashboards for deployed models
- Contribute - Submit PRs for new features and improvements
We're working on comprehensive tutorials:
- 📘 Basic Tutorial: Train your first SAE
- 📗 Feature Analysis: Discover interpretable concepts
- 📙 Feature Steering: Modify model behavior
- 📕 Multi-Layer Analysis: Compare features across depths
- 📓 Neuronpedia Integration: Share your discoveries
- 📔 Case Studies: Real findings from nanochat models
Want to contribute a tutorial? Open an issue or PR!
- Direct sparsity control: Choose exactly k active features
- Fewer dead latents: More stable training at scale
- Best for: Initial exploration, interpretability research
- Reference: OpenAI's scaling work
- Traditional approach: ReLU activation + L1 penalty
- Requires tuning: Must find good L1 coefficient
- Best for: Understanding SAE fundamentals
- Separates magnitude and selection: More expressive
- More complex: Harder to train and interpret
- Best for: Advanced experiments
- Activation collection: ~10-20GB per layer for 10M activations
- SAE training: Requires 40GB+ VRAM for large SAEs
- Runtime inference: +10GB memory for all SAEs loaded
- Activation collection: <5% slowdown during training
- SAE inference: 5-10% latency increase
- SAE training: 2-4 hours per layer on A100
- Store activations on CPU during collection to save GPU memory
- Train SAEs on subset of layers (e.g., every 5th layer)
- Use smaller expansion factors (4x instead of 16x) for faster training
- Enable lazy loading of SAEs to reduce memory usage
SAEs are evaluated on three key dimensions:
- MSE Loss: Mean squared error between original and reconstructed activations
- Explained Variance: Fraction of activation variance captured
- Reconstruction Score: 1 - MSE/variance
- L0: Average number of active features per activation
- L1: Average L1 norm of feature activations
- Dead Latents: Fraction of features that never activate
- Activation Frequency: How often each feature fires
- Top Activating Examples: Inputs that maximally activate features
- Feature Descriptions: Auto-generated via Neuronpedia (optional)
We welcome contributions! Areas for improvement:
- 🔬 Research: Novel SAE architectures, evaluation metrics
- 🎨 Visualization: Better dashboards, interactive tools
- 📚 Documentation: Tutorials, case studies, explanations
- 🔧 Engineering: Performance optimizations, bug fixes
- 🧪 Experiments: Discover interesting features, share findings
Getting started:
- Open an issue to discuss your idea
- Fork the repo and create a feature branch
- Submit a PR with clear description and tests
- We'll review and provide feedback
If you use nanochat-SAE in your research:
@software{nanochat_sae_2025,
title = {nanochat-SAE: Mechanistic Interpretability for Nanochat},
author = {DeLeeuw, Caleb},
year = {2025},
url = {https://github.com/SolshineCode/nanochat-SAE},
note = {Research extension of nanochat by Andrej Karpathy}
}
@software{nanochat_2025,
title = {nanochat: The best ChatGPT that $100 can buy},
author = {Karpathy, Andrej},
year = {2025},
url = {https://github.com/karpathy/nanochat}
}- Scaling and Evaluating Sparse Autoencoders (OpenAI, 2024)
- Towards Monosemanticity (Anthropic, 2023)
- Sparse Autoencoders Find Highly Interpretable Features (DeepMind, 2024)
- SAELens - Comprehensive SAE training library
- Neuronpedia - Platform for sharing and exploring features
- TransformerLens - Mechanistic interpretability toolkit
- Nanochat Discussions - Main nanochat community
- Alignment Forum - Interpretability research discussions
- EleutherAI Discord - AI research community
- Andrej Karpathy for creating nanochat and inspiring accessible AI education
- OpenAI Superalignment Team for pioneering SAE scaling research
- Anthropic for mechanistic interpretability foundations
- SAELens contributors for open-source SAE tools
- Neuronpedia team for feature sharing infrastructure
MIT License (same as nanochat)
# Clone and train your model
git clone https://github.com/SolshineCode/nanochat-SAE
cd nanochat-SAE
bash speedrun.sh
# Train SAEs and explore
python -m scripts.sae_train --checkpoint models/d20/base_final.pt --layer 10
python -m scripts.sae_viz --sae_path sae_outputs/layer_10/best_model.pt --all_featuresQuestions? Open an issue or start a discussion!
Found something cool? Tweet at @karpathy and share your discoveries!
nanochat-SAE: Because understanding your $100 ChatGPT is just as important as building it.

