Ref imp3 mech interp by ShutingXie · Pull Request #26 · VectorInstitute/interpretability-llms-agents

ShutingXie · 2026-03-11T07:51:04Z

Summary

This pull request introduces a new mechanistic interpretability module for LLMs and VLMs, providing both documentation and a local Python package setup.

Clickup Ticket(s): Link(s) if applicable.

Type of Change

🐛 Bug fix (non-breaking change that fixes an issue)
✨ New feature (non-breaking change that adds functionality)
💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
📝 Documentation update
🔧 Refactoring (no functional changes)
⚡ Performance improvement
🧪 Test improvements
🔒 Security fix

Changes Made

Add implementations/mechanistic_interpretability/README.md describing the module, prerequisites, notebook goals, dependencies, and resources.
Add two tutorial notebooks under implementations/mechanistic_interpretability/src/:
- Mechanistic_Interpretability_LLM_Tutorial.ipynb (SAE / superposition workflow)
- Mechanistic_Interpretability_VLM_Tutorial.ipynb (logit lens + activation patching for modality fusion)
Add implementations/mechanistic_interpretability/pyproject.toml to make the module installable (e.g., pip install -e .) with pinned dependencies.
Add supporting example data implementations/mechanistic_interpretability/data/cat.jpg.

Testing

Tests pass locally (uv run pytest tests/)
Type checking passes (uv run mypy <src_dir>)
Linting passes (uv run ruff check src_dir/)
Manual testing performed (describe below)

Manual testing details:

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Code follows the project's style guidelines
Self-review of code completed
Documentation updated (if applicable)
No sensitive information (API keys, credentials) exposed

aravind-3105 · 2026-03-11T15:56:18Z

@ShutingXie I ran it on coder platform and some small dependency difference that I addressed myself with a commit. I also move the notebooks to root instead of src/ just to make it consistent with other implementations.

Copilot

Pull request overview

This PR adds a new mechanistic interpretability module (docs + notebooks) and sets it up as an installable local Python package.

Changes:

Added a pyproject.toml to install the module with pinned dependencies.
Added module-level README documenting goals, prerequisites, and how to run the tutorials.
Added an LLM mechanistic-interpretability tutorial notebook (SAE workflow + steering + “dark matter” metrics).

Reviewed changes

Copilot reviewed 2 out of 5 changed files in this pull request and generated 4 comments.

File	Description
implementations/mechanistic_interpretability/pyproject.toml	Defines the installable package and pins runtime dependencies for the tutorials.
implementations/mechanistic_interpretability/README.md	Documents the module purpose, prerequisites, notebook goals, and setup instructions.
implementations/mechanistic_interpretability/Mechanistic_Interpretability_LLM_Tutorial.ipynb	Provides a hands-on SAE/TransformerLens tutorial, including feature discovery, steering, and reconstruction/behavior metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

implementations/mechanistic_interpretability/pyproject.toml

implementations/mechanistic_interpretability/README.md

implementations/mechanistic_interpretability/Mechanistic_Interpretability_LLM_Tutorial.ipynb

aravind-3105

I ran it on coder platform and some small dependency difference that I addressed myself with a commit. I also move the notebooks to root instead of src/ just to make it consistent with other implementations.
I've added copilot review to this as well. Check its comment once and its good to go from my end. Great job on the implementation!

shainarazavi

are the image sources cited, whether its a single image?

shainarazavi

Overall this is a very nice tutorial notebook. The structure and explanations are clear and it works well as a teaching resource. I have a few small suggestions for clarity:

In a few places the code refers to MLP output activations, but for GPT-2 the SAE here is trained on the residual stream (hook_resid_pre). It might be clearer to rename variables like mlp_out to something more general such as activations to avoid confusion.

There appears to be a small typo in the KL computation:

kl = kl = torch.nn.functional.kl_div(...)
This likely should be:
kl = torch.nn.functional.kl_div(...)
The masking removes BOS/PAD tokens, but it may also help to filter the EOS token so it does not appear in the top-activation contexts.

In the clamping section, it might be helpful to mention that the encode → edit → decode step inside the hook can be computationally expensive during generation, which is fine for a tutorial but worth noting.

these are mostly small clarity improvements.

shainarazavi

I think same small typo in KL computation kl = kl = torch.nn.functional.kl_di you can correct it.
The masking removes BOS and PAD tokens, which is good.
However, it may also be useful to filter the EOS token so that it does not appear in the top-activation contexts.

In a few comments the text still refers to MLP-out activations, even though the GPT-2 configuration uses the residual stream hook. It may help to make the wording consistent with the chosen hook point.

In the feature clamping section, the encode → edit → decode operation happens inside a forward hook during generation. It might be helpful to briefly note that this can be computationally expensive, but is acceptable for a tutorial demonstration.

shainarazavi · 2026-03-17T01:11:13Z

@ShutingXie thanks for the work, @aravind-3105 has addressed most of the comments by himself, thanks to him, rest a few you can resolve and let Aravind know , and then merge

ShutingXie · 2026-03-17T07:40:02Z

@shainarazavi @aravind-3105 Thank you for the comments! I’ve addressed the remaining ones. Reassigning @aravind-3105 as reviewer for a final look before merging.

…ility implementation

… skip zero-activation positions in top-k selection, and omit BOS from context windows

…ping during generation

…encies into root pyproject.toml and remove obsolete implementation-specific file

aravind-3105 · 2026-03-18T16:23:37Z

@shainarazavi @aravind-3105 Thank you for the comments! I’ve addressed the remaining ones. Reassigning @aravind-3105 as reviewer for a final look before merging.

Thanks Shuting for the updates. Everything looks good. I've added minor updates based on the coder platform. I will merge it now.

ShutingXie requested review from aravind-3105 and shainarazavi March 11, 2026 07:51

aravind-3105 assigned ShutingXie Mar 11, 2026

aravind-3105 added the enhancement New feature or request label Mar 11, 2026

aravind-3105 requested a review from Copilot March 11, 2026 15:55

Copilot started reviewing on behalf of aravind-3105 March 11, 2026 15:55 View session

Copilot AI reviewed Mar 11, 2026

View reviewed changes

aravind-3105 approved these changes Mar 11, 2026

View reviewed changes

shainarazavi reviewed Mar 16, 2026

View reviewed changes

ShutingXie requested a review from aravind-3105 March 17, 2026 07:40

ShutingXie and others added 15 commits March 18, 2026 11:16

Add data

d7172a5

Add VLM tutorial and its corresponding environment

6786b0a

Fix GitHub notebook rendering by removing invalid metadata.widgets

a6f457b

Remove commented pip install line in VLM tutorial

76c220b

Add LLM tutorial and environment file

5afe7db

Update README

dba1a29

Update dependencies and fix notebook paths in mechanistic interpretab…

fec6646

…ility implementation

Add citation to the cat image

9635f53

Correct the naming confusion of the mlp_out variables

b981e49

Remove a typo of KL assignment

0033bf8

Fix show_top_contexts_for_feature to exclude EOS tokens from ranking,…

c145e07

… skip zero-activation positions in top-k selection, and omit BOS from context windows

Add note on the computational cost of encode-edit-decode feature clam…

72605b8

…ping during generation

Add pandas to pyproject.toml

f5e6ae0

Update the README wording to match the actual location

8a4b377

Strip notebooks to stdout-only outputs and remove execution metadata

3354f6b

aravind-3105 force-pushed the ref-imp3-mech-interp branch from 0ea5d79 to 3354f6b Compare March 18, 2026 15:18

aravind-3105 added 3 commits March 18, 2026 12:14

Refactor dependency management: consolidate mechanistic-interp depend…

6456334

…encies into root pyproject.toml and remove obsolete implementation-specific file

Added updated uv.lock

81a3383

Update README for troubleshooting and installation instructions

81166a3

aravind-3105 merged commit 39f1aac into main Mar 18, 2026
2 checks passed

aravind-3105 deleted the ref-imp3-mech-interp branch March 18, 2026 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ref imp3 mech interp#26

Ref imp3 mech interp#26
aravind-3105 merged 18 commits intomainfrom
ref-imp3-mech-interp

ShutingXie commented Mar 11, 2026 •

edited by aravind-3105

Loading

Uh oh!

aravind-3105 commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aravind-3105 left a comment

Uh oh!

shainarazavi left a comment

Uh oh!

shainarazavi left a comment

Uh oh!

shainarazavi left a comment

Uh oh!

shainarazavi commented Mar 17, 2026

Uh oh!

ShutingXie commented Mar 17, 2026

Uh oh!

aravind-3105 commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ShutingXie commented Mar 11, 2026 • edited by aravind-3105 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Type of Change

Changes Made

Testing

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

Uh oh!

aravind-3105 commented Mar 11, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aravind-3105 left a comment

Choose a reason for hiding this comment

Uh oh!

shainarazavi left a comment

Choose a reason for hiding this comment

Uh oh!

shainarazavi left a comment

Choose a reason for hiding this comment

Uh oh!

shainarazavi left a comment

Choose a reason for hiding this comment

Uh oh!

shainarazavi commented Mar 17, 2026

Uh oh!

ShutingXie commented Mar 17, 2026

Uh oh!

aravind-3105 commented Mar 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ShutingXie commented Mar 11, 2026 •

edited by aravind-3105

Loading