Skip to content

Conversation

@R3hankhan123
Copy link
Contributor

@R3hankhan123 R3hankhan123 commented Feb 12, 2026

Purpose

Accelerate paged attention GEMMs (QK, PV) on s390x with vector intrinsics
This PR accelerates cpu_attention_with_kv_cache on s390x by introducing VXE (Vector Extension Facility) optimized GEMM kernels for both QK and PV attention phases. The vectorized implementation significantly improves token generation throughput, enabling s390x to effectively utilize chunked prefill and prefix caching features.

Test Plan

  1. Run vllm bench with and without the accelerated page attention
vllm bench throughput \
  --num-prompts 32 \
  --input-len 512 \
  --output-len 1 \
  --max-model-len 1024 \
  --model Qwen/Qwen2-0.5B-Instruct \
  --load-format dummy

Test Result

With vxe enabled

Throughput: 0.03 requests/s, 14.83 total tokens/s, 1.65 output tokens/s
Total num prompt tokens:  32768
Total num output tokens:  4096

without vxe

Throughput: 0.03 requests/s, 16.19 total tokens/s, 0.03 output tokens/s
Total num prompt tokens:  16384
Total num output tokens:  32

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@mergify mergify bot added cpu Related to CPU backends v1 labels Feb 12, 2026
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant performance improvements for the s390x architecture by leveraging vector intrinsics to accelerate attention mechanisms. The changes to enable the new 'vxe' ISA are well-integrated across the codebase. However, I have identified a critical bug in the new csrc/cpu/cpu_attn_vxe.hpp file concerning the handling of c10::Half data types, which could lead to incorrect computations or crashes. Additionally, I've noted several instances of unnecessary const_cast usage that should be addressed to enhance code quality and maintainability.

Comment on lines +317 to +326
} else {
__vector float v0 = vec_xl((long long)0, (float*)curr_src + d);
__vector float v1 = vec_xl((long long)0, (float*)curr_src + d + 4);

v0 = vec_mul(v0, scale_vec);
v1 = vec_mul(v1, scale_vec);

vec_xst(v0, 0, curr_dst + d);
vec_xst(v1, 0, curr_dst + d + 4);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

This else block incorrectly handles the c10::Half data type. The expression (float*)curr_src + d performs pointer arithmetic on a float*, which is incorrect when scalar_t is c10::Half because d is an index for scalar_t elements. This will lead to incorrect memory accesses and produce wrong results.

To fix this, you should add an if constexpr block to handle c10::Half separately, similar to how c10::BFloat16 is handled. Inside this block, you should manually convert c10::Half elements to float before loading them into vector registers. The else branch can then be assumed to handle only float.

For example:

} else if constexpr (std::is_same_v<scalar_t, c10::Half>) {
  // Manual conversion for c10::Half
  alignas(16) float tmp[8];
  for (int j = 0; j < 8; ++j) {
    tmp[j] = static_cast<float>(curr_src[d + j]);
  }
  __vector float v0 = vec_xl(0LL, tmp);
  __vector float v1 = vec_xl(0LL, tmp + 4);
  // ... apply scale and store
} else { // float
  const auto* float_src = reinterpret_cast<const float*>(curr_src);
  __vector float v0 = vec_xl(0LL, float_src + d);
  __vector float v1 = vec_xl(0LL, float_src + d + 4);
  // ... apply scale and store
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cpu Related to CPU backends v1

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant