-
Notifications
You must be signed in to change notification settings - Fork 1
[REJECT?] Daily Perf Improver - Optimize matrix transpose with loop unrolling and adaptive block sizing #32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
github-actions
wants to merge
4
commits into
main
Choose a base branch
from
perf/optimize-transpose-with-simd-e50d581c0aea48e5-667c1aec05363204
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 3 commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
adcd821
Optimize matrix transpose with loop unrolling and adaptive block sizing
github-actions[bot] 01e2a74
Merge branch 'main' into perf/optimize-transpose-with-simd-e50d581c0a…
dsyme d927749
Delete .claude/hooks/network_permissions.py
dsyme 1c8d556
Merge branch 'main' into perf/optimize-transpose-with-simd-e50d581c0a…
dsyme File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -141,43 +141,72 @@ type Matrix<'T when 'T :> Numerics.INumber<'T> | |
| this.GetSlice(rowStart, rowEnd, colStart, colEnd) | ||
|
|
||
|
|
||
| /// Creates a new matrix by initializing each element with a function `f(row, col)`. | ||
| /// <summary> | ||
| /// Transposes a matrix using cache-friendly blocked algorithm with loop unrolling. | ||
| /// The block size is chosen adaptively based on element type for optimal cache utilization. | ||
| /// Within each block, uses loop unrolling to reduce loop overhead and improve instruction-level parallelism. | ||
| /// </summary> | ||
| static member inline private transposeByBlock<'T when 'T :> Numerics.INumber<'T> | ||
| and 'T : (new: unit -> 'T) | ||
| and 'T : struct | ||
| and 'T :> ValueType> | ||
| (rows : int) | ||
| (rows : int) | ||
| (cols : int) | ||
| (data: 'T[]) | ||
| (data: 'T[]) | ||
| (blockSize: int) = | ||
|
|
||
| //let blockSize = defaultArg blockSize 16 | ||
|
|
||
| let src = data | ||
| let dst = Array.zeroCreate<'T> (rows * cols) | ||
|
|
||
| let vectorSize = Numerics.Vector<'T>.Count | ||
|
|
||
| // Process the matrix in blocks | ||
| // Process the matrix in blocks for cache efficiency | ||
| for i0 in 0 .. blockSize .. rows - 1 do | ||
| for j0 in 0 .. blockSize .. cols - 1 do | ||
|
|
||
| let iMax = min (i0 + blockSize) rows | ||
| let jMax = min (j0 + blockSize) cols | ||
|
|
||
| // Within each block, unroll the innermost loop by 4 | ||
| for i in i0 .. iMax - 1 do | ||
| let srcOffset = i * cols | ||
| for j in j0 .. jMax - 1 do | ||
| let v = src.[srcOffset + j] | ||
| let mutable j = j0 | ||
| let srcRowOffset = i * cols | ||
|
|
||
| // Unrolled loop: process 4 columns at a time | ||
| while j + 3 < jMax do | ||
| let v0 = src.[srcRowOffset + j] | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I guess maybe the point is that this becomes a vectorized read and a vectorized write. |
||
| let v1 = src.[srcRowOffset + j + 1] | ||
| let v2 = src.[srcRowOffset + j + 2] | ||
| let v3 = src.[srcRowOffset + j + 3] | ||
|
|
||
| dst.[j * rows + i] <- v0 | ||
| dst.[(j + 1) * rows + i] <- v1 | ||
| dst.[(j + 2) * rows + i] <- v2 | ||
| dst.[(j + 3) * rows + i] <- v3 | ||
|
|
||
| j <- j + 4 | ||
|
|
||
| // Handle remaining columns | ||
| while j < jMax do | ||
| let v = src.[srcRowOffset + j] | ||
| dst.[j * rows + i] <- v | ||
| j <- j + 1 | ||
|
|
||
| dst | ||
|
|
||
| static member inline transpose (m:Matrix<'T>) : Matrix<'T> = | ||
| m.Transpose() | ||
|
|
||
| /// <summary> | ||
| /// Transposes this matrix (rows become columns, columns become rows). | ||
| /// Uses an adaptive block size based on element type for optimal cache performance. | ||
| /// </summary> | ||
| member this.Transpose() = | ||
| let blocksize = 16 | ||
| // Adaptive block size based on element type | ||
| // Larger elements (float64) benefit from smaller blocks to fit in L1 cache | ||
| // Smaller elements (float32, int) can use larger blocks | ||
| let blocksize = | ||
| match sizeof<'T> with | ||
| | 4 -> 32 // float32 or int32: 32x32 block = 4KB fits in L1 | ||
| | 8 -> 16 // float64: 16x16 block = 2KB fits in L1 | ||
| | _ -> 16 // fallback for other types | ||
| Matrix(this.NumCols, this.NumRows, Matrix.transposeByBlock this.NumRows this.NumCols this.Data blocksize) | ||
|
|
||
| static member init<'T when 'T :> Numerics.INumber<'T> | ||
|
|
||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a real shame .NET JIT doesn't seem to do this. It would be good to validate whether it has this capability in some scenarios (and they just aren't being used). It's not the sort of code we really want to have lying around.