[CPU][Perf] Accelerate Attention head for s390x using vector intrinsics #34434

R3hankhan123 · 2026-02-12T14:07:38Z

Purpose

Accelerate paged attention GEMMs (QK, PV) on s390x with vector intrinsics
This PR accelerates cpu_attention_with_kv_cache on s390x by introducing VXE (Vector Extension Facility) optimized GEMM kernels for both QK and PV attention phases. The vectorized implementation significantly improves token generation throughput, enabling s390x to effectively utilize chunked prefill and prefix caching features.

Test Plan

Run vllm bench with and without the accelerated page attention

vllm bench throughput \
  --num-prompts 32 \
  --input-len 512 \
  --output-len 1 \
  --max-model-len 1024 \
  --model Qwen/Qwen2-0.5B-Instruct \
  --load-format dummy

Test Result

With vxe enabled

Throughput: 0.03 requests/s, 14.83 total tokens/s, 1.65 output tokens/s
Total num prompt tokens:  32768
Total num output tokens:  4096

without vxe

Throughput: 0.03 requests/s, 16.19 total tokens/s, 0.03 output tokens/s
Total num prompt tokens:  16384
Total num output tokens:  32

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Signed-off-by: Rehan Khan <[email protected]>

gemini-code-assist

Code Review

This pull request introduces significant performance improvements for the s390x architecture by leveraging vector intrinsics to accelerate attention mechanisms. The changes to enable the new 'vxe' ISA are well-integrated across the codebase. However, I have identified a critical bug in the new csrc/cpu/cpu_attn_vxe.hpp file concerning the handling of c10::Half data types, which could lead to incorrect computations or crashes. Additionally, I've noted several instances of unnecessary const_cast usage that should be addressed to enhance code quality and maintainability.

gemini-code-assist · 2026-02-12T14:12:20Z

csrc/cpu/cpu_attn_vxe.hpp

+          } else {
+            __vector float v0 = vec_xl((long long)0, (float*)curr_src + d);
+            __vector float v1 = vec_xl((long long)0, (float*)curr_src + d + 4);
+
+            v0 = vec_mul(v0, scale_vec);
+            v1 = vec_mul(v1, scale_vec);
+
+            vec_xst(v0, 0, curr_dst + d);
+            vec_xst(v1, 0, curr_dst + d + 4);
+          }


This else block incorrectly handles the c10::Half data type. The expression (float*)curr_src + d performs pointer arithmetic on a float*, which is incorrect when scalar_t is c10::Half because d is an index for scalar_t elements. This will lead to incorrect memory accesses and produce wrong results.

To fix this, you should add an if constexpr block to handle c10::Half separately, similar to how c10::BFloat16 is handled. Inside this block, you should manually convert c10::Half elements to float before loading them into vector registers. The else branch can then be assumed to handle only float.

For example:

} else if constexpr (std::is_same_v<scalar_t, c10::Half>) { // Manual conversion for c10::Half alignas(16) float tmp[8]; for (int j = 0; j < 8; ++j) { tmp[j] = static_cast<float>(curr_src[d + j]); } __vector float v0 = vec_xl(0LL, tmp); __vector float v1 = vec_xl(0LL, tmp + 4); // ... apply scale and store } else { // float const auto* float_src = reinterpret_cast<const float*>(curr_src); __vector float v0 = vec_xl(0LL, float_src + d); __vector float v1 = vec_xl(0LL, float_src + d + 4); // ... apply scale and store }

csrc/cpu/cpu_attn_vxe.hpp

[CPU][Perf] Accelerate Attention head for s390x using vector intrinsics

0421cda

Signed-off-by: Rehan Khan <[email protected]>

R3hankhan123 requested review from LucasWilkinson and bigPYJ1151 as code owners February 12, 2026 14:07

mergify bot added cpu Related to CPU backends v1 labels Feb 12, 2026

gemini-code-assist bot reviewed Feb 12, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[CPU][Perf] Accelerate Attention head for s390x using vector intrinsics #34434

[CPU][Perf] Accelerate Attention head for s390x using vector intrinsics #34434

R3hankhan123 commented Feb 12, 2026 •

edited by github-actions bot

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Feb 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

[CPU][Perf] Accelerate Attention head for s390x using vector intrinsics #34434

Are you sure you want to change the base?

[CPU][Perf] Accelerate Attention head for s390x using vector intrinsics #34434

Conversation

R3hankhan123 commented Feb 12, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 12, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

R3hankhan123 commented Feb 12, 2026 •

edited by github-actions bot

Loading