Skip to content

Conversation

kylesayrs
Copy link
Collaborator

@kylesayrs kylesayrs commented Aug 19, 2025

Coauthored with @dichn!

Purpose

  • Add support for calibrate_all_experts option, which sends all tokens to all experts, but still produces the same outputs as if tokens had been gated

Changes

  • Modify model definitions such that, in the case of calibrate_all_experts=True token gating occurs after passing tokens to experts, rather than before
# `calibrate_all_experts=True` by default
model = replace_modules_for_calibration(model, calibrate_all_experts=True)

Testing

  • Added correctness tests for new model definitions which checks that outputs are exactly the same
  • Added hook tests to make sure all experts are being sent tokens

dichn and others added 2 commits August 17, 2025 17:17
Change Purpose:
- Add calibrate_all_experts option to improve MoE calibration

Change Details:
- Add `calibrate_all_experts` flag to MoE layers
- Update `replace_modules_for_calibration` and `moe_calibration_context`
  to propagate the flag into modules
- Modify expert forward passes:
  * Normal mode (default): compute output only for tokens routed to
    top-k experts, and combine their weighted results in the final
    output
  * Calibration mode (`calibrate_all_experts=True`): compute output for
    all tokens on every expert, but still apply the top-k gating to
    decide which token outputs contribute to the final result.

Testing:
- Add unit test to verify all experts are triggered during MoE calibration
Copy link

👋 Hi! Thank you for contributing to llm-compressor. Please add the ready label when the PR is ready for review.

Note: This is required to complete the testing suite, please only add the label once the PR is code complete and local testing has been performed.

@kylesayrs kylesayrs changed the title [Calibrat] Llama4 and More tests [MoE] Llama4 and More tests Aug 19, 2025
@kylesayrs kylesayrs changed the title [MoE] Llama4 and More tests [MoE] MoE Calibration with calibrate_all_experts Aug 28, 2025
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
Signed-off-by: Kyle Sayers <[email protected]>
@kylesayrs kylesayrs marked this pull request as ready for review August 28, 2025 21:00
@kylesayrs kylesayrs requested review from dsikka and shanjiaz August 28, 2025 21:05
Copy link
Collaborator

@fynnsu fynnsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a comment below. I also agree with @brian-dellabetta's point that this could maybe be simplified by patching self.top_k temporarily.

Signed-off-by: Kyle Sayers <[email protected]>
Copy link
Collaborator

@dsikka dsikka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not get a chance to run through these as of yet but it would be good to run through nvfp4 for llama4 and qwen3 and validating performance on the b200 before landing this, if anybody has bandwidth to run these

@kylesayrs kylesayrs marked this pull request as draft September 9, 2025 11:49
@kylesayrs
Copy link
Collaborator Author

Running those examples now

@@ -974,7 +974,8 @@ def getattr_chain(obj: Any, chain_str: str, *args, **kwargs) -> Any:
return res


class DisableKVCache:
@contextlib.contextmanager
def disable_cache(module: torch.nn.Module):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

definitely agree with these changes, but might be better in a separate PR or added in the PR summary. Seems orthogonal to calibrate_all_experts

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants