Skip to content

[Transformations] Apply SDPA scale to K^T when query is pre-scaled#34177

Open
evkotov wants to merge 8 commits intoopenvinotoolkit:masterfrom
evkotov:CVS-181409
Open

[Transformations] Apply SDPA scale to K^T when query is pre-scaled#34177
evkotov wants to merge 8 commits intoopenvinotoolkit:masterfrom
evkotov:CVS-181409

Conversation

@evkotov
Copy link
Contributor

@evkotov evkotov commented Feb 18, 2026

Summary

When query is pre-scaled (Multiply(Q, scalar_constant)), apply SDPA scale to K^T instead of Q during decomposition. This restores the original computation order of the graph before SDPAFusion.

Details

When SDPAFusion matches a symmetric pre-scaling pattern (both Q and K multiplied by the same constant), it absorbs the K-side scale into the SDPA scale parameter and leaves Q pre-scaled. The decomposition was applying scale back to Q, which changes the FP32 operation order compared to the original graph. While mathematically equivalent, the rounding differences accumulate through residual connections across transformer layers (up to 0.91 max_diff on RFDetr with 14 blocks).

The fix detects that Q is pre-scaled (Multiply with a scalar constant) and applies scale to K^T instead, restoring the original computation order. The existing can_move_scale_after_matmul and default Q-scaling paths remain unchanged for other cases.

Tickets:

  • 181409
  • 180477

@evkotov evkotov self-assigned this Feb 18, 2026
@evkotov evkotov requested a review from a team as a code owner February 18, 2026 12:14
@evkotov evkotov added the category: transformations OpenVINO Runtime library - Transformations label Feb 18, 2026
@CuriousPanCake
Copy link
Contributor

  1. If I read the description of your PR correctly:

This change moves SDPAFusion (and SDPAScaleFusion) from MOCTransformations to CommonOptimizations so they only run during compile_model(), where each plugin can control whether SDPA nodes are created and kept.

there might be a case when SDPA is not fused, and it's not going to be fused at the conversion stage, which is completely not ok for the SDPAToPA case which requires the fused SDPA. Maybe there're other problematic cases.

but the decomposed graph has a different FP32 computation order, causing accuracy loss that amplifies through transformer layers.

What do you mean? Is the order of operations between fused and unfused SDPA implementations different? How so? This is a strict formula with matrix multiplication where order is crucial.

@evkotov evkotov changed the title Move SDPAFusion from MOCTransformations to CommonOptimizations Fix SDPA decomposition to preserve original scale application order Feb 26, 2026
@evkotov evkotov requested a review from v-Golubev February 26, 2026 12:46
@v-Golubev v-Golubev self-assigned this Feb 26, 2026
@evkotov evkotov changed the title Fix SDPA decomposition to preserve original scale application order [Transformations] Apply SDPA scale to K^T when query is pre-scaled Mar 5, 2026
@evkotov
Copy link
Contributor Author

evkotov commented Mar 5, 2026

  1. If I read the description of your PR correctly:

This change moves SDPAFusion (and SDPAScaleFusion) from MOCTransformations to CommonOptimizations so they only run during compile_model(), where each plugin can control whether SDPA nodes are created and kept.

there might be a case when SDPA is not fused, and it's not going to be fused at the conversion stage, which is completely not ok for the SDPAToPA case which requires the fused SDPA. Maybe there're other problematic cases.

but the decomposed graph has a different FP32 computation order, causing accuracy loss that amplifies through transformer layers.

What do you mean? Is the order of operations between fused and unfused SDPA implementations different? How so? This is a strict formula with matrix multiplication where order is crucial.

I updated approach with new description

Copy link
Contributor

@CuriousPanCake CuriousPanCake left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the ScaledDotProductAttentionDecomposition transformation to preserve the original scaling order in cases where the SDPA query input is already pre-scaled, aiming to reduce FP32 rounding divergence after decomposition (notably for CPU paths that decompose SDPA to MatMul+Softmax+MatMul).

Changes:

  • Add is_query_prescaled() heuristic and, when it matches, apply scale to K^T instead of applying it on Q or after the first MatMul.
  • Extend the decomposition unit test suite with a regression-style graph-structure test for the pre-scaled query case.
  • Update the test helper to optionally build a reference graph that applies scale on K^T.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/common/transformations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp Adds pre-scaled-query detection and switches scale placement to K^T in that case.
src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp Adds a new test validating the expected decomposition structure for pre-scaled query + explicit scale.

evkotov added 8 commits March 16, 2026 13:30
SDPAFusion was registered inside MOCTransformations, which runs both
during ov.convert_model() and during compile_model(). The CPU plugin
explicitly disables SDPAFusion because it does not use SDPA nodes,
but when ov.convert_model() is used, SDPA nodes are already created
before compile_model() is called. The CPU plugin then decomposes them
back via ScaledDotProductAttentionDecomposition, but the decomposed
graph has a different FP32 computation order, causing accuracy loss
that amplifies through transformer layers (0.91 max_diff on RFDetr
with 14 attention blocks).

Move SDPAFusion and SDPAScaleFusion to CommonOptimizations so they
only run during compile_model(), where each plugin controls whether
SDPA nodes are created and kept.

Tickets: CVS-180477
When SDPAFusion absorbs a K-side scale into the SDPA node, the query
input may already be pre-scaled (e.g. Q * 0.353). During decomposition,
the scale was applied to Q again or moved after the MatMul, changing the
FP32 computation order from the original (Q * s) @ (K^T * s) to
((Q * s) * s) @ K^T. While mathematically equivalent, this produces
different intermediate rounding and accumulates ~0.91 max_diff over 14
transformer layers in models like RFDetr.

Add is_query_prescaled() check in ScaledDotProductAttentionDecomposition:
if Q is already a Multiply(input, scalar_constant), apply the scale to
K^T instead, restoring the original computation order.

Fixes CVS-180477 (Bug 2a).
Remove is_query_prescaled() heuristic and can_move_scale_after_matmul()
optimization. Always apply SDPA scale to K^T during decomposition, since
the scale logically belongs to K^T (absorbed from K-side by SDPAFusion).

This is mathematically equivalent for all cases and preserves the original
computation order for models with symmetric Q/K pre-scaling (e.g. PyTorch
scaled_dot_product_attention export), fixing the FP32 rounding divergence
that accumulated through transformer layers (CVS-180477, Bug 2a).

Tickets: CVS-181409, CVS-180477
…fallback

Address PR openvinotoolkit#34177 review comments:
- [HIGH] Restore can_move_scale_after_matmul() size-based heuristic as
  performance fallback for non-prescaled query cases (e.g. decode S_q=1)
- [LOW] Reword comments to not imply SDPAFusion is always involved

Three-way scale placement logic:
1. Q pre-scaled (Multiply(Q, scalar_const)) -> scale K^T (precision fix)
2. can_move_scale_after_matmul -> scale after MatMul (perf optimization)
3. Default -> scale Q
@evkotov evkotov force-pushed the CVS-181409 branch 2 times, most recently from 5e8c833 to fa89e73 Compare March 16, 2026 13:10
@evkotov evkotov requested a review from Copilot March 16, 2026 13:10
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the ScaledDotProductAttention decomposition to apply the SDPA scale factor on the K^T operand unconditionally, aligning the decomposition’s computation order with graphs produced by SDPAFusion (notably for PyTorch-exported pre-scaled Q/K) to reduce FP32 rounding drift.

Changes:

  • Changed SDPA decomposition to always multiply K^T by scale before the Q·K^T MatMul (removing the previous heuristic that sometimes scaled Q or post-MatMul output).
  • Simplified the unit-test reference decomposition helper to match the new behavior.
  • Renamed/updated scale-related tests and added coverage for the “pre-scaled query” case to validate the intended computation order.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
src/common/transformations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp Applies scale on K^T unconditionally in the decomposition (removes size-based conditional scaling).
src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp Updates reference graph builder + test expectations; adds a regression test for pre-scaled query to ensure scaling is applied on K^T.

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the SDPA (ScaledDotProductAttention) decomposition to preserve computation order when the query input is already pre-scaled (typical after SDPAFusion absorbs the K-side scale), reducing FP32 rounding divergence across transformer layers.

Changes:

  • Adjust SDPA decomposition logic to apply the SDPA scale to K^T when Q is detected as pre-scaled.
  • Update existing decomposition reference construction in unit tests to reflect the new scaling placement.
  • Add a new unit test covering the “pre-scaled query” scenario to ensure scale is applied on K^T.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
src/common/transformations/src/transformations/op_conversions/scaled_dot_product_attention_decomposition.cpp Adds detection of pre-scaled Q and applies scale to K^T in that case before MatMul.
src/common/transformations/tests/op_conversions/scaled_dot_product_decomposition_test.cpp Updates reference decomposition helper/signature and adds a regression test for pre-scaled Q.

Comment on lines +176 to +180
if (is_query_prescaled(query)) {
// Q is already pre-scaled (e.g., Multiply(Q, scalar_constant)).
// Apply scale to K^T to preserve the original computation order
// and minimize FP rounding divergence across transformer layers.
auto k_scaled = register_new_node<v1::Multiply>(k_transposed, scale);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

category: transformations OpenVINO Runtime library - Transformations

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants