transformer-lens

Here are 26 public repositories matching this topic...

JihoonJeong / Neural-MRI

Model Resonance Imaging — visualize LLM internals like a brain MRI

react visualization transformers pytorch d3js interpretability fastapi llm mechanistic-interpretability transformer-lens

Updated Apr 10, 2026
Python

RishabSA / interp-refusal-tokens

Star

We study whether categorical refusal tokens enable controllable and interpretable safety behavior in language models.

machine-learning research ai deep-learning pytorch artificial-intelligence safety llama steering neurips llm mechanistic-interpretability llm-safety refusal llama3 transformer-lens llm-refusal

Updated Jun 1, 2026
Python

designer-coderajay / logit-lens-explorer

Star

Mechanistic interpretability tool visualizing GPT-2's layer-by-layer predictions using the logit lens technique

nlp deep-learning transformers pytorch gpt-2 streamlit mechanistic-interpretability transformer-lens

Updated Feb 13, 2026
Python

designer-coderajay / Causally-Grounded-Mechanistic-Interpretability-for-LLMs-with-Faithful-Natural-Language-Explanations

Star

MSc Thesis: Bridging mechanistic interpretability circuits to faithful natural language explanations using ERASER evaluation metrics

msc-thesis explainability gpt-2 natural-language-explanations mechanistic-interpretability transformer-lens eraser-metrics

Updated May 30, 2026
Jupyter Notebook

GuanchunLi / manifold-counting-task

Star

An independent, from-scratch reproduction of the mechanistic-interpretability findings in Anthropic's When Models Manipulate Manifolds: The Geometry of a Counting Task

gemma sparse-autoencoders mechanistic-interpretability transformer-lens causal-interventions

Updated Jun 22, 2026
Jupyter Notebook

adaboranyilmaz / mechanistic-circuit-comparison

Star

Mechanistic interpretability study comparing modular addition and subtraction circuits in 1-layer attention-only transformers via activation patching, logit lens, SVD circuit analysis, Fourier feature analysis, and causal scrubbing across three training stages.

transformers circuits pytorch modular-arithmetic attention-mechanism circuit-analysis fourier-analysis grokking interpretable-machine-learning algorithmic-analysis mechanistic-interpretability causal-intervention activation-patching transformer-lens

Updated May 2, 2026
Python

designer-coderajay / induction-head-detector

Star

Mechanistic interpretability tool to detect induction heads in GPT-2 using TransformerLens

nlp machine-learning deep-learning transformers pytorch gpt-2 attention-heads mechanistic-interpretability transformer-lens

Updated Dec 15, 2025
Python

FrancescoPaoloL / xsa_POC

Star

Measuring attention similarity bias in GPT-2 variants via TransformerLens. Replicates Figure 1 of arXiv:2603.09078. Finds U-shaped trend, not the monotonically increasing one the paper reports.

python transformers attention-mechanism mechanistic-interpretability transformer-lens

Updated May 20, 2026
Python

RaggedR / octopus-streams

Star

Mechanistic interpretability of small transformers: RSK correspondence and Pythia-70m

transformers pythia algebraic-combinatorics mechanistic-interpretability transformer-lens

Updated Jun 10, 2026
Python

atgugu / mechinterp-rfh-replication

Star

Replication of 'From Reasoning to Answer' (EMNLP 2025) — Reasoning-Focus Heads + Activation Patching on DeepSeek-R1-Distill-Qwen-7B

reasoning attention-heads mechanistic-interpretability llm-interpretability deepseek-r1 emnlp-2025 transformer-lens

Updated Mar 27, 2026
Jupyter Notebook

LGOICOUR / belief-state-geometry

Star

Mechanistic interpretability: belief-state geometry in a transformer's residual stream. From-scratch replication of Shai et al. 2024 (arXiv:2405.15943).

transformers pytorch computational-mechanics interpretability mechanistic-interpretability neurips-2024 transformer-lens belief-states

Updated Jun 25, 2026
Jupyter Notebook

himanshuvnm / TransformerLensCausalTracing

Star

neural-network transformer transformer-architecture causal-tracing transformer-lens

Updated Jan 22, 2026
Jupyter Notebook

designer-coderajay / activation-patching-framework

Star

Causal intervention framework for mechanistic interpretability research. Implements activation patching methodology for identifying causally important components in transformer language models.

nlp machine-learning deep-learning pytorch interpretability gpt-2 mechanistic-interpretability causal-tracing activation-patching transformer-lens

Updated Dec 17, 2025
Python

ashlrai / mechanistic-interpretability

Star

Local agent-driven mechanistic interpretability research platform for Apple Silicon

sparse-autoencoders ai-safety acdc interpretability apple-silicon mechanistic-interpretability activation-patching abliteration mech-interp transformer-lens

Updated May 28, 2026
Python

earlyprototype / lucier-gpt2-activ-tensor-reson-experiments

Star

Inspired by Alvin Lucier's I Am Sitting in a Room (1969), this applies an analogous rendering process to GPT-2 Small: the model's activation tensor is excited through iterative forward-pass feedback, repeating 500 times. As semantic content dissolves, dominant attractor states emerge, revealing the model's naked inner voice.

pytorch ai-research gpt-2 mechanistic-interpretability transformer-lens

Updated Jul 2, 2026
Jupyter Notebook

nekaeve24 / Neural-DNA-Forensics

Star

Forensic suite for Mechanistic Interpretability in Transformers. Implementing 0.0054 Basal Accountability Gradients for auditing model logic using TransformerLens and SAELens

pytorch ai-safety quantitative-research mechanistic-interpretability transformer-lens

Updated Mar 9, 2026
Python

azrabano23 / interp

Star

Ask your coding agent WHY a language model made a prediction — mechanistic interpretability (logit lens, activation patching, SAE features, steering) as a drop-in agent skill. Validated against published circuits.

transformers sparse-autoencoders ai-safety interpretability llm mechanistic-interpretability activation-patching claude-code transformer-lens agent-skill

Updated Jun 8, 2026
Python

sagnikc395 / circuit-surgeon

Star

Automated Forensic Discovery of Reasoning Circuits in Transformers

pytorch llms mech-interp transformer-lens

Updated Apr 28, 2026
Python

AryanDinakaran / mechanistic-interpretability-1-layer-transformer

Star

🧠 Unmasking the AI black box: A hands-on experiment in mechanistic interpretability for the AI-curious optimist.

mechanistic-interpretability transformer-lens

Updated Jun 1, 2026
Python

grishinak / spbu-xai-course

Star

eXplainable AI course

xai torchvision captum umap-learn transformer-lens

Updated May 30, 2026
Jupyter Notebook

Improve this page

Add a description, image, and links to the transformer-lens topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the transformer-lens topic, visit your repo's landing page and select "manage topics."

Learn more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformer-lens

Here are 26 public repositories matching this topic...

JihoonJeong / Neural-MRI

RishabSA / interp-refusal-tokens

designer-coderajay / logit-lens-explorer

designer-coderajay / Causally-Grounded-Mechanistic-Interpretability-for-LLMs-with-Faithful-Natural-Language-Explanations

GuanchunLi / manifold-counting-task

adaboranyilmaz / mechanistic-circuit-comparison

designer-coderajay / induction-head-detector

FrancescoPaoloL / xsa_POC

RaggedR / octopus-streams

atgugu / mechinterp-rfh-replication

LGOICOUR / belief-state-geometry

himanshuvnm / TransformerLensCausalTracing

designer-coderajay / activation-patching-framework

ashlrai / mechanistic-interpretability

earlyprototype / lucier-gpt2-activ-tensor-reson-experiments

nekaeve24 / Neural-DNA-Forensics

azrabano23 / interp

sagnikc395 / circuit-surgeon

AryanDinakaran / mechanistic-interpretability-1-layer-transformer

grishinak / spbu-xai-course

Improve this page

Add this topic to your repo