Skip to content

Ref imp3 mech interp#26

Merged
aravind-3105 merged 18 commits intomainfrom
ref-imp3-mech-interp
Mar 18, 2026
Merged

Ref imp3 mech interp#26
aravind-3105 merged 18 commits intomainfrom
ref-imp3-mech-interp

Conversation

@ShutingXie
Copy link
Collaborator

@ShutingXie ShutingXie commented Mar 11, 2026

Summary

This pull request introduces a new mechanistic interpretability module for LLMs and VLMs, providing both documentation and a local Python package setup.

Clickup Ticket(s): Link(s) if applicable.

Type of Change

  • 🐛 Bug fix (non-breaking change that fixes an issue)
  • ✨ New feature (non-breaking change that adds functionality)
  • 💥 Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • 📝 Documentation update
  • 🔧 Refactoring (no functional changes)
  • ⚡ Performance improvement
  • 🧪 Test improvements
  • 🔒 Security fix

Changes Made

  • Add implementations/mechanistic_interpretability/README.md describing the module, prerequisites, notebook goals, dependencies, and resources.
  • Add two tutorial notebooks under implementations/mechanistic_interpretability/src/:
    • Mechanistic_Interpretability_LLM_Tutorial.ipynb (SAE / superposition workflow)
    • Mechanistic_Interpretability_VLM_Tutorial.ipynb (logit lens + activation patching for modality fusion)
  • Add implementations/mechanistic_interpretability/pyproject.toml to make the module installable (e.g., pip install -e .) with pinned dependencies.
  • Add supporting example data implementations/mechanistic_interpretability/data/cat.jpg.

Testing

  • Tests pass locally (uv run pytest tests/)
  • Type checking passes (uv run mypy <src_dir>)
  • Linting passes (uv run ruff check src_dir/)
  • Manual testing performed (describe below)

Manual testing details:

Screenshots/Recordings

Related Issues

Deployment Notes

Checklist

  • Code follows the project's style guidelines
  • Self-review of code completed
  • Documentation updated (if applicable)
  • No sensitive information (API keys, credentials) exposed

@aravind-3105
Copy link
Member

@ShutingXie I ran it on coder platform and some small dependency difference that I addressed myself with a commit. I also move the notebooks to root instead of src/ just to make it consistent with other implementations.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds a new mechanistic interpretability module (docs + notebooks) and sets it up as an installable local Python package.

Changes:

  • Added a pyproject.toml to install the module with pinned dependencies.
  • Added module-level README documenting goals, prerequisites, and how to run the tutorials.
  • Added an LLM mechanistic-interpretability tutorial notebook (SAE workflow + steering + “dark matter” metrics).

Reviewed changes

Copilot reviewed 2 out of 5 changed files in this pull request and generated 4 comments.

File Description
implementations/mechanistic_interpretability/pyproject.toml Defines the installable package and pins runtime dependencies for the tutorials.
implementations/mechanistic_interpretability/README.md Documents the module purpose, prerequisites, notebook goals, and setup instructions.
implementations/mechanistic_interpretability/Mechanistic_Interpretability_LLM_Tutorial.ipynb Provides a hands-on SAE/TransformerLens tutorial, including feature discovery, steering, and reconstruction/behavior metrics.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Member

@aravind-3105 aravind-3105 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran it on coder platform and some small dependency difference that I addressed myself with a commit. I also move the notebooks to root instead of src/ just to make it consistent with other implementations.
I've added copilot review to this as well. Check its comment once and its good to go from my end. Great job on the implementation!

Copy link
Collaborator

@shainarazavi shainarazavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are the image sources cited, whether its a single image?

Copy link
Collaborator

@shainarazavi shainarazavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this is a very nice tutorial notebook. The structure and explanations are clear and it works well as a teaching resource. I have a few small suggestions for clarity:

In a few places the code refers to MLP output activations, but for GPT-2 the SAE here is trained on the residual stream (hook_resid_pre). It might be clearer to rename variables like mlp_out to something more general such as activations to avoid confusion.

There appears to be a small typo in the KL computation:

kl = kl = torch.nn.functional.kl_div(...)
This likely should be:
kl = torch.nn.functional.kl_div(...)
The masking removes BOS/PAD tokens, but it may also help to filter the EOS token so it does not appear in the top-activation contexts.

In the clamping section, it might be helpful to mention that the encode → edit → decode step inside the hook can be computationally expensive during generation, which is fine for a tutorial but worth noting.

these are mostly small clarity improvements.

Copy link
Collaborator

@shainarazavi shainarazavi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think same small typo in KL computation kl = kl = torch.nn.functional.kl_di you can correct it.
The masking removes BOS and PAD tokens, which is good.
However, it may also be useful to filter the EOS token so that it does not appear in the top-activation contexts.

In a few comments the text still refers to MLP-out activations, even though the GPT-2 configuration uses the residual stream hook. It may help to make the wording consistent with the chosen hook point.

In the feature clamping section, the encode → edit → decode operation happens inside a forward hook during generation. It might be helpful to briefly note that this can be computationally expensive, but is acceptable for a tutorial demonstration.

@shainarazavi
Copy link
Collaborator

@ShutingXie thanks for the work, @aravind-3105 has addressed most of the comments by himself, thanks to him, rest a few you can resolve and let Aravind know , and then merge

@ShutingXie
Copy link
Collaborator Author

@shainarazavi @aravind-3105 Thank you for the comments! I’ve addressed the remaining ones. Reassigning @aravind-3105 as reviewer for a final look before merging.

@ShutingXie ShutingXie requested a review from aravind-3105 March 17, 2026 07:40
@aravind-3105 aravind-3105 force-pushed the ref-imp3-mech-interp branch from 0ea5d79 to 3354f6b Compare March 18, 2026 15:18
@aravind-3105
Copy link
Member

@shainarazavi @aravind-3105 Thank you for the comments! I’ve addressed the remaining ones. Reassigning @aravind-3105 as reviewer for a final look before merging.

Thanks Shuting for the updates. Everything looks good. I've added minor updates based on the coder platform. I will merge it now.

@aravind-3105 aravind-3105 merged commit 39f1aac into main Mar 18, 2026
2 checks passed
@aravind-3105 aravind-3105 deleted the ref-imp3-mech-interp branch March 18, 2026 16:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants