Model Resonance Imaging — visualize LLM internals like a brain MRI
-
Updated
Apr 10, 2026 - Python
Model Resonance Imaging — visualize LLM internals like a brain MRI
We study whether categorical refusal tokens enable controllable and interpretable safety behavior in language models.
Mechanistic interpretability tool visualizing GPT-2's layer-by-layer predictions using the logit lens technique
MSc Thesis: Bridging mechanistic interpretability circuits to faithful natural language explanations using ERASER evaluation metrics
An independent, from-scratch reproduction of the mechanistic-interpretability findings in Anthropic's When Models Manipulate Manifolds: The Geometry of a Counting Task
Mechanistic interpretability study comparing modular addition and subtraction circuits in 1-layer attention-only transformers via activation patching, logit lens, SVD circuit analysis, Fourier feature analysis, and causal scrubbing across three training stages.
Mechanistic interpretability tool to detect induction heads in GPT-2 using TransformerLens
Measuring attention similarity bias in GPT-2 variants via TransformerLens. Replicates Figure 1 of arXiv:2603.09078. Finds U-shaped trend, not the monotonically increasing one the paper reports.
Mechanistic interpretability of small transformers: RSK correspondence and Pythia-70m
Replication of 'From Reasoning to Answer' (EMNLP 2025) — Reasoning-Focus Heads + Activation Patching on DeepSeek-R1-Distill-Qwen-7B
Mechanistic interpretability: belief-state geometry in a transformer's residual stream. From-scratch replication of Shai et al. 2024 (arXiv:2405.15943).
Causal intervention framework for mechanistic interpretability research. Implements activation patching methodology for identifying causally important components in transformer language models.
Local agent-driven mechanistic interpretability research platform for Apple Silicon
Inspired by Alvin Lucier's I Am Sitting in a Room (1969), this applies an analogous rendering process to GPT-2 Small: the model's activation tensor is excited through iterative forward-pass feedback, repeating 500 times. As semantic content dissolves, dominant attractor states emerge, revealing the model's naked inner voice.
Forensic suite for Mechanistic Interpretability in Transformers. Implementing 0.0054 Basal Accountability Gradients for auditing model logic using TransformerLens and SAELens
Ask your coding agent WHY a language model made a prediction — mechanistic interpretability (logit lens, activation patching, SAE features, steering) as a drop-in agent skill. Validated against published circuits.
Automated Forensic Discovery of Reasoning Circuits in Transformers
🧠 Unmasking the AI black box: A hands-on experiment in mechanistic interpretability for the AI-curious optimist.
eXplainable AI course
Add a description, image, and links to the transformer-lens topic page so that developers can more easily learn about it.
To associate your repository with the transformer-lens topic, visit your repo's landing page and select "manage topics."