Skip to content

Conversation

@PerkzZheng
Copy link
Contributor

@PerkzZheng PerkzZheng commented Nov 7, 2025

📌 Description

This MR fixes:

  1. unspecified cuda launch errors with 2CTA MLA kernels
  2. masking bug of SWA decode kernels.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Added Sparse MLA support and propagated its flag through kernel selection and dispatch.
  • Bug Fixes / Improvements

    • Enforced power-of-two page sizing for paged KV caches and tightened head-dimension limits for broader hardware compatibility.
    • Updated kernel trait encoding and hash construction to include the sparse MLA flag and revised bit-field layout.
  • Chores

    • Updated runtime kernel artifact identifiers and checksums.
    • Extended kernel parameter fields, zero-initialized params on setup, and populated tokens-per-page log2 for paged KV.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @PerkzZheng, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers crucial fixes for trtllm-gen attention kernels, resolving critical CUDA launch errors in 2CTA MLA kernels and a masking issue affecting SWA decode kernels. Beyond these immediate bug resolutions, the changes also update artifact references and introduce several new parameters to the kernel configuration, improving the robustness and future extensibility of the attention mechanisms, particularly for sparse attention and memory management.

Highlights

  • Kernel Bug Fixes: Addressed unspecified CUDA launch errors in 2CTA MLA kernels and a masking bug in SWA decode kernels.
  • Artifact Updates: Updated paths and checksums for trtllm-gen FMHA and BMM artifacts to reflect the latest versions.
  • Kernel Parameter Refinements: Modified TllmGenFmhaKernel::hashID to include a sparseMla parameter, adjusted head dimension limits from 2048 to 1024, and enforced numTokensPerPage to be a power of 2.
  • New Kernel Parameters: Introduced ptrReservedMem, mNumTokensPerPageLog2, mReservedParam, mSparseMlaTopK, and mUseBlockSparseAttention in KernelParams for enhanced functionality and explicit zero-initialization.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Nov 7, 2025

Walkthrough

Updates TRTLLM FMHA artifact path and checksum constants; extends FMHA kernel hash encoding to include a new sparseMla flag with adjusted bit-field layout and stricter head-dimension checks; and adds paged-KV / sparse-related fields to KernelParams with zero-initialization and log2 computation for numTokensPerPage.

Changes

Cohort / File(s) Change Summary
Artifact constants
flashinfer/artifacts.py
Replaced ArtifactPath.TRTLLM_GEN_FMHA value and updated CheckSumHash.TRTLLM_GEN_FMHA checksum string.
FMHA kernel hash & encoding
include/flashinfer/trtllm/fmha/fmhaKernels.cuh
Added sparseMla bool to hashID and threaded through call sites; remapped bit-field shifts/offsets (tileSizeKv, log2(numTokensPerPage), etc.); added sparseMla bit; tightened head-dim upper bounds (2048→1024); enforced power-of-two for numTokensPerPage.
Kernel parameters struct & init
include/flashinfer/trtllm/fmha/kernelParams.h
Added ptrReservedMem (int32_t*), mNumTokensPerPageLog2 (int32_t), mReservedParam (float), mSparseMlaTopK (int32_t); setKernelParams() now zero-initializes KernelParams and validates/computes mNumTokensPerPageLog2 for paged KV.

Sequence Diagram(s)

sequenceDiagram
    autonumber
    participant Runner as Runner / Dispatch
    participant Selector as Kernel Selector
    participant Meta as KernelMeta
    participant Loader as Kernel Loader

    Note over Runner,Selector: Build selection key from runtime params
    Runner->>Selector: hashFromRunnerParams(params, /* sparseMla */ false)
    Selector->>Meta: select candidate KernelMeta
    Note right of Meta: KernelMeta includes mSparseMla
    Selector->>Loader: hashID(kernelMeta, sparseMla=Meta.mSparseMla)
    Loader->>Loader: assemble 64-bit hash (includes sparseMla bit, log2(numTokensPerPage))
    Loader->>Runner: return selected kernel / load artifacts (uses updated artifact checksum)
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Areas needing careful review:
    • include/flashinfer/trtllm/fmha/fmhaKernels.cuh — bit offsets/packing, inclusion of sparseMla bit, and head-dim limit changes.
    • All hashID call sites — ensure consistent propagation of the new boolean (real value vs placeholder).
    • include/flashinfer/trtllm/fmha/kernelParams.h — zero-initialization safety (memset) and correct power-of-two/log2 handling.
    • flashinfer/artifacts.py — verify artifact path/checksum strings against authoritative source.

Possibly related PRs

Suggested reviewers

  • aleozlx
  • joker-eph
  • cyx-6
  • nvmbreughe

Poem

🐇
I nudged the bits and tucked a flag inside,
Pages now counted, tiles neatly spied,
Hashes hum truer, artifacts align,
Kernels hop ready — sparse, swift, and fine,
A carrot-coded patch, all snugly tied. 🥕

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Fix: several bugs/issues with trtllm-gen attention kernels' is specific and directly related to the changeset, which updates artifacts for trtllm-gen and adds parameters related to sparse MLA to kernel files.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Description check ✅ Passed The pull request description follows the template structure but has incomplete sections. Related Issues and Reviewer Notes are empty.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates artifact hashes and refines the kernel selection logic for trtllm-gen attention kernels. Key changes include adding a sparseMla parameter to the hashID function, adjusting bit shifts for head dimensions, and enforcing that numTokensPerPage must be a power of 2. New members have been added to the KernelParams struct to support these changes, and the struct is now explicitly zero-initialized using memset for improved safety. These modifications appear to address the reported CUDA launch errors and masking bugs, enhancing the robustness and correctness of the attention kernels.

@yzh119
Copy link
Collaborator

yzh119 commented Nov 7, 2025

@PerkzZheng would you mind rebasing to main branch? Seems there are some merge conflicts.

@PerkzZheng PerkzZheng force-pushed the user/perkzz/update-trtllm-gen-1107 branch from 8dc0a1b to e4d7f46 Compare November 7, 2025 10:39
@PerkzZheng
Copy link
Contributor Author

@PerkzZheng would you mind rebasing to main branch? Seems there are some merge conflicts.

it was rebased to a wrong remote. It should be good now. Thanks

Copy link
Contributor

@pavanimajety pavanimajety left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR

Copy link
Contributor

@nvmbreughe nvmbreughe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Just wondering: for what config did we get failures without this fix? I think it would be good to have a test. I can add it after this PR.

)
TRTLLM_GEN_BMM: str = (
"46ccf0492e3ed10135c2861a4f4ef9bb45846610f9a9d2ccaf2d5bf01d2006fd"
"1ebace613389a4f2e10b14315da5d522642c5dcaae23f01213d56c59068f148b"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to update the BMM hash in this PR?

@yzh119
Copy link
Collaborator

yzh119 commented Nov 8, 2025

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !122 has been created, and the CI pipeline #38107936 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #38107936: 7/17 passed

@yzh119
Copy link
Collaborator

yzh119 commented Nov 8, 2025

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !122 has been updated with latest changes, and the CI pipeline #38135771 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #38135771: 14/17 passed

@yzh119 yzh119 merged commit d56748f into flashinfer-ai:main Nov 9, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants