Skip to content

Conversation

@christiangnrd
Copy link
Member

Opened to run benchmarks.

Todo:

  • Add compat bound when GPUArrays version released

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metal Benchmarks

Benchmark suite Current: 84f519a Previous: de3fd23 Ratio
latency/precompile 25018084416 ns 25055876416 ns 1.00
latency/ttfp 2129990500 ns 2125052000 ns 1.00
latency/import 1225508166 ns 1219352833 ns 1.01
integration/metaldevrt 956625 ns 968354.5 ns 0.99
integration/byval/slices=1 1644625 ns 1660375 ns 0.99
integration/byval/slices=3 10295687.5 ns 8945875 ns 1.15
integration/byval/reference 1633875 ns 1638208 ns 1.00
integration/byval/slices=2 2747625 ns 2721062.5 ns 1.01
kernel/indexing 692437.5 ns 703875 ns 0.98
kernel/indexing_checked 681375 ns 694208 ns 0.98
kernel/launch 13020.5 ns 12875 ns 1.01
array/construct 6292 ns 6083 ns 1.03
array/broadcast 660542 ns 670666.5 ns 0.98
array/random/randn/Float32 849938 ns 879916 ns 0.97
array/random/randn!/Float32 621917 ns 639812.5 ns 0.97
array/random/rand!/Int64 554792 ns 567000 ns 0.98
array/random/rand!/Float32 589083 ns 602916 ns 0.98
array/random/rand/Int64 752104.5 ns 754292 ns 1.00
array/random/rand/Float32 545291 ns 574541 ns 0.95
array/accumulate/Int64/1d 2378188 ns 1336875 ns 1.78
array/accumulate/Int64/dims=1 2295312.5 ns 1912291.5 ns 1.20
array/accumulate/Int64/dims=2 2555417 ns 2256916.5 ns 1.13
array/accumulate/Int64/dims=1L 6595145.5 ns 11644666.5 ns 0.57
array/accumulate/Int64/dims=2L 18580062.5 ns 9900979.5 ns 1.88
array/accumulate/Float32/1d 1685084 ns 1245625 ns 1.35
array/accumulate/Float32/dims=1 2124459 ns 1630541.5 ns 1.30
array/accumulate/Float32/dims=2 2386125 ns 1968750 ns 1.21
array/accumulate/Float32/dims=1L 5082146 ns 9898709 ns 0.51
array/accumulate/Float32/dims=2L 14983750 ns 7337354 ns 2.04
array/reductions/reduce/Int64/1d 1349687.5 ns 1381500.5 ns 0.98
array/reductions/reduce/Int64/dims=1 1177333 ns 1154562.5 ns 1.02
array/reductions/reduce/Int64/dims=2 1291041 ns 1287541 ns 1.00
array/reductions/reduce/Int64/dims=1L 2127500 ns 2078000 ns 1.02
array/reductions/reduce/Int64/dims=2L 3575749.5 ns 3569083 ns 1.00
array/reductions/reduce/Float32/1d 1015125 ns 1047333.5 ns 0.97
array/reductions/reduce/Float32/dims=1 885375 ns 899875 ns 0.98
array/reductions/reduce/Float32/dims=2 800416 ns 801708.5 ns 1.00
array/reductions/reduce/Float32/dims=1L 1386084 ns 1393042 ns 1.00
array/reductions/reduce/Float32/dims=2L 1909291 ns 1903875 ns 1.00
array/reductions/mapreduce/Int64/1d 1338875 ns 1353375 ns 0.99
array/reductions/mapreduce/Int64/dims=1 1142416 ns 1160042 ns 0.98
array/reductions/mapreduce/Int64/dims=2 1287270.5 ns 1282979 ns 1.00
array/reductions/mapreduce/Int64/dims=1L 2103667 ns 2111146 ns 1.00
array/reductions/mapreduce/Int64/dims=2L 3442541.5 ns 3466062 ns 0.99
array/reductions/mapreduce/Float32/1d 975541 ns 1083604 ns 0.90
array/reductions/mapreduce/Float32/dims=1 890208.5 ns 902542 ns 0.99
array/reductions/mapreduce/Float32/dims=2 788042 ns 819041.5 ns 0.96
array/reductions/mapreduce/Float32/dims=1L 1385458.5 ns 1404791.5 ns 0.99
array/reductions/mapreduce/Float32/dims=2L 1918042 ns 1904375 ns 1.01
array/private/copyto!/gpu_to_gpu 642271 ns 661417 ns 0.97
array/private/copyto!/cpu_to_gpu 819417 ns 827708 ns 0.99
array/private/copyto!/gpu_to_cpu 818854.5 ns 823833 ns 0.99
array/private/iteration/findall/int 1746687.5 ns 1654645.5 ns 1.06
array/private/iteration/findall/bool 1575458 ns 1502750 ns 1.05
array/private/iteration/findfirst/int 1933875 ns 2023208 ns 0.96
array/private/iteration/findfirst/bool 1745458 ns 1852750 ns 0.94
array/private/iteration/scalar 4033375 ns 5040709 ns 0.80
array/private/iteration/logical 2604125 ns 2707041 ns 0.96
array/private/iteration/findmin/1d 1990291 ns 2059979 ns 0.97
array/private/iteration/findmin/2d 1640084 ns 1638750 ns 1.00
array/private/copy 580229.5 ns 566958.5 ns 1.02
array/shared/copyto!/gpu_to_gpu 80708 ns 79375 ns 1.02
array/shared/copyto!/cpu_to_gpu 79708 ns 81333 ns 0.98
array/shared/copyto!/gpu_to_cpu 80000 ns 78750 ns 1.02
array/shared/iteration/findall/int 1761916.5 ns 1657354 ns 1.06
array/shared/iteration/findall/bool 1683500 ns 1507000 ns 1.12
array/shared/iteration/findfirst/int 1539875 ns 1648125 ns 0.93
array/shared/iteration/findfirst/bool 1427729.5 ns 1429542 ns 1.00
array/shared/iteration/scalar 161459 ns 159083 ns 1.01
array/shared/iteration/logical 2442375 ns 2359208 ns 1.04
array/shared/iteration/findmin/1d 1511833 ns 1598729.5 ns 0.95
array/shared/iteration/findmin/2d 1630792 ns 1642520.5 ns 0.99
array/shared/copy 250604 ns 253958 ns 0.99
array/permutedims/4d 2465500 ns 2460792 ns 1.00
array/permutedims/2d 1249208.5 ns 1249583.5 ns 1.00
array/permutedims/3d 1743167 ns 1743375 ns 1.00
metal/synchronization/stream 14875 ns 14875 ns 1
metal/synchronization/context 15708 ns 15500 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

@github-actions
Copy link
Contributor

github-actions bot commented Jul 20, 2025

Your PR requires formatting changes to meet the project's style guidelines.
Please consider running Runic (git runic main) to apply these changes.

Click here to view the suggested changes.
diff --git a/test/runtests.jl b/test/runtests.jl
index 9b6b0c3d..6d16c110 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -11,7 +11,7 @@ if parse(Bool, get(ENV, "BUILDKITE", "false"))
 end
 
 using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="accumulatetests")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "accumulatetests")
 
 # Quit without erroring if Metal loaded without issues on unsupported platforms
 if !Sys.isapple()

@christiangnrd
Copy link
Member Author

christiangnrd commented Jul 20, 2025

As expected, some small regressions for most accumulate benchmarks, with a massive regression when accumulating along rows of a 3x1000000 matrix.

The performance improvement with column-wise accumulation with 3x1000000 matrices comes from Metal missing an easy optimization (see #626) Edit: I got confused this optimization is only present for reductions.

@christiangnrd christiangnrd changed the title Switch to GPUArrays.jl accumulate implementation Switch to GPUArrays.jl accumulate implementation Jul 20, 2025
@codecov
Copy link

codecov bot commented Jul 20, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 80.35%. Comparing base (1942968) to head (b296d15).

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #625      +/-   ##
==========================================
- Coverage   80.63%   80.35%   -0.29%     
==========================================
  Files          61       60       -1     
  Lines        2722     2678      -44     
==========================================
- Hits         2195     2152      -43     
+ Misses        527      526       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@christiangnrd christiangnrd changed the title Switch to GPUArrays.jl accumulate implementation [Do not merge] Switch to GPUArrays.jl accumulate implementation Jul 23, 2025
@maleadt
Copy link
Member

maleadt commented Jul 29, 2025

As expected, some small regressions for most accumulate benchmarks, with a massive regression when accumulating along rows of a 3x1000000 matrix.

I don't see a massive slowdown?

@christiangnrd
Copy link
Member Author

@maleadt The accumulate dims=2L benchmarks show a 2x slowdown. Did I get my rows/columns mixed up in my comment?

@maleadt
Copy link
Member

maleadt commented Jul 30, 2025

Oh OK, I didn't consider 2x a "massive slowdown" :-) Still something to look at of course, but much less dramatic than the 7x regressions we e.g. saw against CUDA.jl's reduction.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants