Adding support for BlockedKV attention in CasualLM models #618

vaibverm · 2025-11-14T00:30:37Z

Objective:

This PR introduces the KV blocking technique for CausalLM models where the K/V cache is read and processed block by block in the attention computation. Number of desired KV blocks are defined at model initialization in the "from_pretrained" call to export the ONNX with required number of KV blocks. As a result, the following changes are introduced:

Changes:

SoftMax needs to be changed from regular SoftMax to online SoftMax where the running maximum and cumulative denominators are tracked and updated once each block is processed to retain mathematical accuracy compared to regular SoftMax.
Changes to CTXGather and CTXGatherCB custom ops to read only 1 block worth of data in each cache gather/read.
Changes to read_only function in QEffDynamicCache to allow reading of a cache block by block rather than full K/V cache.
Generation of attention mask per block.
Changes to eager_attention_forward implementation in the llama model to allow BlockedKV attention and online SoftMax implementation.
Wrapping the num_kv_blocks variable inside qaic_config to keep consistent calling style.
A new PyTorch transform to pass the num_kv_blocks variable to QEffLlamaAttention block.
A new constant added for num_kv_blocks.
Added tests to switch the BlockedKV feature on and off.

Please review and feel free to suggest changes and tests.

vbaddi · 2025-11-14T03:36:13Z

Thanks @vaibverm
Could you please address the conflicts and run the lint/format?

vaibverm · 2025-11-14T23:23:02Z

Hi @vbaddi,
I have addressed the conflicts but some workflows need approval. Would you be able to approve those?

QEfficient/transformers/models/llama/modeling_llama.py

quic-rishinr · 2025-11-17T08:10:25Z

QEfficient/transformers/models/llama/modeling_llama.py

+            K_block_states = repeat_kv(K_block, module.num_key_value_groups)
+            V_block_states = repeat_kv(V_block, module.num_key_value_groups)
+            past_seen_tokens_start = start_index
+            past_seen_tokens_end = torch.where(


if we are comparing int do we need torch.where? can we use min()?

torch.min() requires both inputs to be tensors. I tried torch.min() before but using torch.min() here leads to ONNX export time error.

QEfficient/transformers/models/llama/modeling_llama.py

quic-rishinr · 2025-11-17T10:07:09Z

QEfficient/transformers/models/llama/modeling_llama.py

+
+            # Compute attention scores for the block
+            attn_weights_block = torch.matmul(query, K_block_states.transpose(2, 3)) * scaling
+            if attention_mask is not None:


if we are not using the attention_mask do we need this condition?

This is a causal model, so we use masking. The reason I kept attention_mask instead of the causal_mask_block in the condition was because earlier I was using a common method for both eager and blockedKV attention. I would suggest using the original condition testing if the overhead is not too high for compatibility reasons with regular eager attention.

QEfficient/transformers/models/modeling_auto.py

quic-rishinr · 2025-11-18T05:38:47Z

QEfficient/transformers/models/pytorch_transforms.py

+                repl_module = type(module)
+                module.__class__ = repl_module
+                module.forward = MethodType(partial(repl_module.forward, num_kv_blocks=num_kv_blocks), module)
+                transformed = True  # Set to True if at least one transformation occurs


Can we add a warning if the arcitecture doesnt support blocked KV

I wanted to have a broader discussion on this. I implemented blockedKV attention for Qwen2.5_VL model as well and the norm there was to use environment variable to switch between different blocking techniques. Is that the norm we want to keep across QEff? If yes, then we will not really need this transform anymore although I think using the PyTorch transform was a cleaner way to switch between different blocking techniques and consistent with current transform usage in QEff.

@quic-rishinr - Would you suggest we should go the route of using environment variables? Or would you prefer PyTorch transforms like above to implement the blocking?

QEfficient/transformers/cache_utils.py

Signed-off-by: Vaibhav Verma <[email protected]>

… number indices Signed-off-by: Vaibhav Verma <[email protected]>

Signed-off-by: Vaibhav Verma <[email protected]>

quic-rishinr · 2025-11-25T07:52:46Z

tests/transformers/models/test_causal_lm_models.py

    )
+
+
+@pytest.mark.parametrize("model_name", test_models_blockedKV)


please add @pytest.mark.on_qaic on the test as both tests would be using the qaic cards

Signed-off-by: Vaibhav Verma <[email protected]>

…llama.py Signed-off-by: Vaibhav Verma <[email protected]>

…ling_llama.py Signed-off-by: Vaibhav Verma <[email protected]>

vaibverm requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners November 14, 2025 00:30

vaibverm force-pushed the main branch from 9e46a31 to c1d11d1 Compare November 14, 2025 00:33

vaibverm force-pushed the main branch 3 times, most recently from 5997515 to 4e817c2 Compare November 14, 2025 08:05

quic-rishinr requested changes Nov 18, 2025

View reviewed changes

vaibverm added 6 commits November 20, 2025 13:14

Adding support for BlockedKV attention in CasualLM models

b8c5a4b

Signed-off-by: Vaibhav Verma <[email protected]>

Updated num_kv_blocks checking within qaic_config to use .get()

415c1ce

Signed-off-by: Vaibhav Verma <[email protected]>

Updated modeling_auto.py to handle num_kv_blocks=None case gracefully

cb1fd07

Signed-off-by: Vaibhav Verma <[email protected]>

Fix to satisfy where op needing a tensor condition and arange needing…

e926853

… number indices Signed-off-by: Vaibhav Verma <[email protected]>

Minor fix for _create_causal_mask arg order

a1915c9

Signed-off-by: Vaibhav Verma <[email protected]>

removing gated llama3.3-70B model from test + lint/format

40af171

Signed-off-by: Vaibhav Verma <[email protected]>

vaibverm force-pushed the main branch from 64a5b83 to 40af171 Compare November 20, 2025 19:18

separated eager attention into separate methods + minor fixes

594fc59

Signed-off-by: Vaibhav Verma <[email protected]>

quic-rishinr requested changes Nov 25, 2025

View reviewed changes

vaibverm added 3 commits November 25, 2025 11:55

Adding pytest.mark.on_qaic on the test

15a282b

Signed-off-by: Vaibhav Verma <[email protected]>

recommitting to fix preflight_Qeff picking wrong version of modeling_…

99ebcd4

…llama.py Signed-off-by: Vaibhav Verma <[email protected]>

Reverted to returning attn_weights in eager_attention_forward in mode…

8143414

…ling_llama.py Signed-off-by: Vaibhav Verma <[email protected]>

		)


		@pytest.mark.parametrize("model_name", test_models_blockedKV)

Adding support for BlockedKV attention in CasualLM models #618

Are you sure you want to change the base?

Adding support for BlockedKV attention in CasualLM models #618

Conversation

vaibverm commented Nov 14, 2025

Objective:

Changes:

Uh oh!

vbaddi commented Nov 14, 2025

Uh oh!

vaibverm commented Nov 14, 2025

Uh oh!

Uh oh!

Uh oh!

quic-rishinr Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

vaibverm Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

quic-rishinr Nov 17, 2025

Choose a reason for hiding this comment

Uh oh!

vaibverm Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quic-rishinr Nov 18, 2025

Choose a reason for hiding this comment

Uh oh!

vaibverm Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

vaibverm Nov 26, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

quic-rishinr Nov 25, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

vaibverm Nov 21, 2025 •

edited

Loading