Skip to content

Conversation

@lezcano
Copy link
Contributor

@lezcano lezcano commented Sep 19, 2025

In view of the corrected computation for bank conflicts discussed in
#8200, we use the point that
anything that goes into vbasis in the same bank does not create
conflicts to expose asymmetric vectorisation whenever it would not
create more bank conflicts.

The new heuristic avoids PRMTs whenever possible on one of the
directions by choosing registers within bank 0 that are already
contiguous in the register file.

I still need to benchmark and write comprehensive tests. Will do that on
Monday.

In view of the corrected computation for bank conflicts discussed in
#8200, we use the point that
anything that goes into vbasis in the same bank does not create
conflicts to expose asymmetric vectorisation whenever it would not
create more bank conflicts.

The new heuristic avoids PRMTs whenever possible on one of the
directions by choosing registers within bank 0 that are already
contiguous in the register file.

I still need to benchmark and write comprehensive tests. Will do that on
Monday.
@Jokeren
Copy link
Contributor

Jokeren commented Sep 19, 2025

The new heuristic avoids PRMTs whenever possible on one of the directions by choosing registers within bank 0

Can you add a mlir test?

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

Comment on lines +583 to +585
// Append the extra vectorisation bases
vbasis.append(vecSrc ? intersectAfterBank0(regSrc, vbasis, laneDstSet)
: intersectAfterBank0(regDst, vbasis, laneSrcSet));

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Prevent vector basis from exceeding 128‑bit limit

The new getVbasis caps the initial intersection to log2(128/bitwidth) but then unconditionally appends more entries from intersectAfterBank0. When the intersection already fills the 128‑bit budget (e.g. three bases for fp16 or two for fp32), this append can grow vbasis beyond maxVecBases (five bases for fp16, four for fp32), which corresponds to 256–512 bit vector widths. Downstream lowering assumes loads/stores are at most 128 bits and uses vbasis.size() to pick instruction widths; returning a longer basis will make the swizzling code attempt to emit vector instructions that do not exist. The previous implementation always truncated vbasis after filling. Consider re-clamping vbasis after the append or skipping the append once vbasis.size() has reached the maximum.

Useful? React with 👍 / 👎.

@lezcano
Copy link
Contributor Author

lezcano commented Sep 19, 2025

an MLIR test would be difficult, as the PRMTs are created at a PTX level. I plan to add plenty of tests from layouts that exercise this path and make sure that the PRMTs as computed from the shared memory layout decrease from the previous state of things to this one.

@Jokeren
Copy link
Contributor

Jokeren commented Sep 19, 2025

Maybe we can use gluon and check the SASS codegen

@lezcano
Copy link
Contributor Author

lezcano commented Sep 19, 2025

I could 100% write those test in gluon, good point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants