Conversation
|
@ShutingXie I ran it on coder platform and some small dependency difference that I addressed myself with a commit. I also move the notebooks to root instead of src/ just to make it consistent with other implementations. |
There was a problem hiding this comment.
Pull request overview
This PR adds a new mechanistic interpretability module (docs + notebooks) and sets it up as an installable local Python package.
Changes:
- Added a
pyproject.tomlto install the module with pinned dependencies. - Added module-level README documenting goals, prerequisites, and how to run the tutorials.
- Added an LLM mechanistic-interpretability tutorial notebook (SAE workflow + steering + “dark matter” metrics).
Reviewed changes
Copilot reviewed 2 out of 5 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| implementations/mechanistic_interpretability/pyproject.toml | Defines the installable package and pins runtime dependencies for the tutorials. |
| implementations/mechanistic_interpretability/README.md | Documents the module purpose, prerequisites, notebook goals, and setup instructions. |
| implementations/mechanistic_interpretability/Mechanistic_Interpretability_LLM_Tutorial.ipynb | Provides a hands-on SAE/TransformerLens tutorial, including feature discovery, steering, and reconstruction/behavior metrics. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
implementations/mechanistic_interpretability/Mechanistic_Interpretability_LLM_Tutorial.ipynb
Outdated
Show resolved
Hide resolved
aravind-3105
left a comment
There was a problem hiding this comment.
I ran it on coder platform and some small dependency difference that I addressed myself with a commit. I also move the notebooks to root instead of src/ just to make it consistent with other implementations.
I've added copilot review to this as well. Check its comment once and its good to go from my end. Great job on the implementation!
shainarazavi
left a comment
There was a problem hiding this comment.
are the image sources cited, whether its a single image?
shainarazavi
left a comment
There was a problem hiding this comment.
Overall this is a very nice tutorial notebook. The structure and explanations are clear and it works well as a teaching resource. I have a few small suggestions for clarity:
In a few places the code refers to MLP output activations, but for GPT-2 the SAE here is trained on the residual stream (hook_resid_pre). It might be clearer to rename variables like mlp_out to something more general such as activations to avoid confusion.
There appears to be a small typo in the KL computation:
kl = kl = torch.nn.functional.kl_div(...)
This likely should be:
kl = torch.nn.functional.kl_div(...)
The masking removes BOS/PAD tokens, but it may also help to filter the EOS token so it does not appear in the top-activation contexts.
In the clamping section, it might be helpful to mention that the encode → edit → decode step inside the hook can be computationally expensive during generation, which is fine for a tutorial but worth noting.
these are mostly small clarity improvements.
shainarazavi
left a comment
There was a problem hiding this comment.
I think same small typo in KL computation kl = kl = torch.nn.functional.kl_di you can correct it.
The masking removes BOS and PAD tokens, which is good.
However, it may also be useful to filter the EOS token so that it does not appear in the top-activation contexts.
In a few comments the text still refers to MLP-out activations, even though the GPT-2 configuration uses the residual stream hook. It may help to make the wording consistent with the chosen hook point.
In the feature clamping section, the encode → edit → decode operation happens inside a forward hook during generation. It might be helpful to briefly note that this can be computationally expensive, but is acceptable for a tutorial demonstration.
|
@ShutingXie thanks for the work, @aravind-3105 has addressed most of the comments by himself, thanks to him, rest a few you can resolve and let Aravind know , and then merge |
|
@shainarazavi @aravind-3105 Thank you for the comments! I’ve addressed the remaining ones. Reassigning @aravind-3105 as reviewer for a final look before merging. |
…ility implementation
… skip zero-activation positions in top-k selection, and omit BOS from context windows
…ping during generation
0ea5d79 to
3354f6b
Compare
…encies into root pyproject.toml and remove obsolete implementation-specific file
Thanks Shuting for the updates. Everything looks good. I've added minor updates based on the coder platform. I will merge it now. |
Summary
This pull request introduces a new mechanistic interpretability module for LLMs and VLMs, providing both documentation and a local Python package setup.
Clickup Ticket(s): Link(s) if applicable.
Type of Change
Changes Made
implementations/mechanistic_interpretability/README.mddescribing the module, prerequisites, notebook goals, dependencies, and resources.implementations/mechanistic_interpretability/src/:Mechanistic_Interpretability_LLM_Tutorial.ipynb(SAE / superposition workflow)Mechanistic_Interpretability_VLM_Tutorial.ipynb(logit lens + activation patching for modality fusion)implementations/mechanistic_interpretability/pyproject.tomlto make the module installable (e.g.,pip install -e .) with pinned dependencies.implementations/mechanistic_interpretability/data/cat.jpg.Testing
uv run pytest tests/)uv run mypy <src_dir>)uv run ruff check src_dir/)Manual testing details:
Screenshots/Recordings
Related Issues
Deployment Notes
Checklist