-
-
Couldn't load subscription status.
- Fork 10.9k
[Bugfix] Ensure calculated KV scales are applied in attention. #27232
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 2 commits
4740f4d
ad1d2e4
34c41a2
c1e72a1
2fe1d53
8a0f110
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -249,6 +249,8 @@ def __init__( | |||||||||||
| # This will be overridden in load_model() | ||||||||||||
| self.is_multimodal_pruning_enabled = False | ||||||||||||
| self.max_model_len = model_config.max_model_len | ||||||||||||
|
|
||||||||||||
| self.kv_scales_calculated = False | ||||||||||||
|
||||||||||||
| self.kv_scales_calculated = False | |
| # Always set to false after the first forward pass | |
| self.calculate_kv_scales = self.cache_config.calculate_kv_scales |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| # Determine if we need to calculate KV scales on this forward pass. | |
| # Only True on the first pass when calculate_kv_scales is enabled. | |
| enable_kv_scales_calculation = ( | |
| self.cache_config.calculate_kv_scales | |
| and not self.kv_scales_calculated) |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| enable_kv_scales_calculation=enable_kv_scales_calculation, | |
| enable_kv_scales_calculation=self.calculate_kv_scales, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Propagate KV scale flag to per-layer metadata
The new enable_kv_scales_calculation flag is set only on CommonAttentionMetadata here, but the per-layer metadata objects that Attention.forward actually receives are produced later by builders (split_attn_metadata and the various *AttentionMetadataBuilder.build) without copying this attribute. As a result getattr(attn_metadata, "enable_kv_scales_calculation", False) in vllm/attention/layer.py remains False and the KV scale calculation path never runs, so the bug this change is meant to fix still occurs whenever KV scales are enabled. The flag needs to be attached to the per-layer metadata returned by the builders (and preserved when splitting) so layers can see it.
Useful? React with 👍 / 👎.
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To improve maintainability and avoid re-evaluating the same condition, you can pass the enable_kv_scales_calculation flag from _prepare_inputs to execute_model. This makes the logic clearer and reduces redundancy.
Here's how you can do it:
-
Update
_prepare_inputsto returnenable_kv_scales_calculation:# In _prepare_inputs function signature ) -> tuple[ ..., bool, # use_cascade_attn bool, # enable_kv_scales_calculation ]: # In _prepare_inputs return statement return ( ..., use_cascade_attn, enable_kv_scales_calculation, )
-
Update the call to
_prepare_inputsinexecute_model:# In execute_model ( ..., use_cascade_attn, enable_kv_scales_calculation, ) = self._prepare_inputs(scheduler_output)
-
Then, you can simplify the logic for updating
self.kv_scales_calculatedas suggested below.
# Mark KV scales as calculated if they were computed in this pass.
if enable_kv_scales_calculation:
self.kv_scales_calculated = True
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even easier:
| if (self.cache_config.calculate_kv_scales | |
| and not self.kv_scales_calculated): | |
| self.kv_scales_calculated = True | |
| self.calculate_kv_scales = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you make sure this is removed from backend-specific attention metadata classes, if anywhere?