-
Notifications
You must be signed in to change notification settings - Fork 2.4k
[BACKEND] Improved swizzling when there is not enough vectorisation #8240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
In view of the corrected computation for bank conflicts discussed in #8200, we use the point that anything that goes into vbasis in the same bank does not create conflicts to expose asymmetric vectorisation whenever it would not create more bank conflicts. The new heuristic avoids PRMTs whenever possible on one of the directions by choosing registers within bank 0 that are already contiguous in the register file. I still need to benchmark and write comprehensive tests. Will do that on Monday.
Can you add a mlir test? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
Reply with @codex fix comments to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
| // Append the extra vectorisation bases | ||
| vbasis.append(vecSrc ? intersectAfterBank0(regSrc, vbasis, laneDstSet) | ||
| : intersectAfterBank0(regDst, vbasis, laneSrcSet)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[P1] Prevent vector basis from exceeding 128‑bit limit
The new getVbasis caps the initial intersection to log2(128/bitwidth) but then unconditionally appends more entries from intersectAfterBank0. When the intersection already fills the 128‑bit budget (e.g. three bases for fp16 or two for fp32), this append can grow vbasis beyond maxVecBases (five bases for fp16, four for fp32), which corresponds to 256–512 bit vector widths. Downstream lowering assumes loads/stores are at most 128 bits and uses vbasis.size() to pick instruction widths; returning a longer basis will make the swizzling code attempt to emit vector instructions that do not exist. The previous implementation always truncated vbasis after filling. Consider re-clamping vbasis after the append or skipping the append once vbasis.size() has reached the maximum.
Useful? React with 👍 / 👎.
|
an MLIR test would be difficult, as the PRMTs are created at a PTX level. I plan to add plenty of tests from layouts that exercise this path and make sure that the PRMTs as computed from the shared memory layout decrease from the previous state of things to this one. |
|
Maybe we can use gluon and check the SASS codegen |
|
I could 100% write those test in gluon, good point. |
In view of the corrected computation for bank conflicts discussed in
#8200, we use the point that
anything that goes into vbasis in the same bank does not create
conflicts to expose asymmetric vectorisation whenever it would not
create more bank conflicts.
The new heuristic avoids PRMTs whenever possible on one of the
directions by choosing registers within bank 0 that are already
contiguous in the register file.
I still need to benchmark and write comprehensive tests. Will do that on
Monday.