-
Notifications
You must be signed in to change notification settings - Fork 46
[Do not merge] Switch to GPUArrays.jl accumulate implementation
#625
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Metal Benchmarks
| Benchmark suite | Current: 84f519a | Previous: de3fd23 | Ratio |
|---|---|---|---|
latency/precompile |
25018084416 ns |
25055876416 ns |
1.00 |
latency/ttfp |
2129990500 ns |
2125052000 ns |
1.00 |
latency/import |
1225508166 ns |
1219352833 ns |
1.01 |
integration/metaldevrt |
956625 ns |
968354.5 ns |
0.99 |
integration/byval/slices=1 |
1644625 ns |
1660375 ns |
0.99 |
integration/byval/slices=3 |
10295687.5 ns |
8945875 ns |
1.15 |
integration/byval/reference |
1633875 ns |
1638208 ns |
1.00 |
integration/byval/slices=2 |
2747625 ns |
2721062.5 ns |
1.01 |
kernel/indexing |
692437.5 ns |
703875 ns |
0.98 |
kernel/indexing_checked |
681375 ns |
694208 ns |
0.98 |
kernel/launch |
13020.5 ns |
12875 ns |
1.01 |
array/construct |
6292 ns |
6083 ns |
1.03 |
array/broadcast |
660542 ns |
670666.5 ns |
0.98 |
array/random/randn/Float32 |
849938 ns |
879916 ns |
0.97 |
array/random/randn!/Float32 |
621917 ns |
639812.5 ns |
0.97 |
array/random/rand!/Int64 |
554792 ns |
567000 ns |
0.98 |
array/random/rand!/Float32 |
589083 ns |
602916 ns |
0.98 |
array/random/rand/Int64 |
752104.5 ns |
754292 ns |
1.00 |
array/random/rand/Float32 |
545291 ns |
574541 ns |
0.95 |
array/accumulate/Int64/1d |
2378188 ns |
1336875 ns |
1.78 |
array/accumulate/Int64/dims=1 |
2295312.5 ns |
1912291.5 ns |
1.20 |
array/accumulate/Int64/dims=2 |
2555417 ns |
2256916.5 ns |
1.13 |
array/accumulate/Int64/dims=1L |
6595145.5 ns |
11644666.5 ns |
0.57 |
array/accumulate/Int64/dims=2L |
18580062.5 ns |
9900979.5 ns |
1.88 |
array/accumulate/Float32/1d |
1685084 ns |
1245625 ns |
1.35 |
array/accumulate/Float32/dims=1 |
2124459 ns |
1630541.5 ns |
1.30 |
array/accumulate/Float32/dims=2 |
2386125 ns |
1968750 ns |
1.21 |
array/accumulate/Float32/dims=1L |
5082146 ns |
9898709 ns |
0.51 |
array/accumulate/Float32/dims=2L |
14983750 ns |
7337354 ns |
2.04 |
array/reductions/reduce/Int64/1d |
1349687.5 ns |
1381500.5 ns |
0.98 |
array/reductions/reduce/Int64/dims=1 |
1177333 ns |
1154562.5 ns |
1.02 |
array/reductions/reduce/Int64/dims=2 |
1291041 ns |
1287541 ns |
1.00 |
array/reductions/reduce/Int64/dims=1L |
2127500 ns |
2078000 ns |
1.02 |
array/reductions/reduce/Int64/dims=2L |
3575749.5 ns |
3569083 ns |
1.00 |
array/reductions/reduce/Float32/1d |
1015125 ns |
1047333.5 ns |
0.97 |
array/reductions/reduce/Float32/dims=1 |
885375 ns |
899875 ns |
0.98 |
array/reductions/reduce/Float32/dims=2 |
800416 ns |
801708.5 ns |
1.00 |
array/reductions/reduce/Float32/dims=1L |
1386084 ns |
1393042 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
1909291 ns |
1903875 ns |
1.00 |
array/reductions/mapreduce/Int64/1d |
1338875 ns |
1353375 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=1 |
1142416 ns |
1160042 ns |
0.98 |
array/reductions/mapreduce/Int64/dims=2 |
1287270.5 ns |
1282979 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
2103667 ns |
2111146 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2L |
3442541.5 ns |
3466062 ns |
0.99 |
array/reductions/mapreduce/Float32/1d |
975541 ns |
1083604 ns |
0.90 |
array/reductions/mapreduce/Float32/dims=1 |
890208.5 ns |
902542 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2 |
788042 ns |
819041.5 ns |
0.96 |
array/reductions/mapreduce/Float32/dims=1L |
1385458.5 ns |
1404791.5 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2L |
1918042 ns |
1904375 ns |
1.01 |
array/private/copyto!/gpu_to_gpu |
642271 ns |
661417 ns |
0.97 |
array/private/copyto!/cpu_to_gpu |
819417 ns |
827708 ns |
0.99 |
array/private/copyto!/gpu_to_cpu |
818854.5 ns |
823833 ns |
0.99 |
array/private/iteration/findall/int |
1746687.5 ns |
1654645.5 ns |
1.06 |
array/private/iteration/findall/bool |
1575458 ns |
1502750 ns |
1.05 |
array/private/iteration/findfirst/int |
1933875 ns |
2023208 ns |
0.96 |
array/private/iteration/findfirst/bool |
1745458 ns |
1852750 ns |
0.94 |
array/private/iteration/scalar |
4033375 ns |
5040709 ns |
0.80 |
array/private/iteration/logical |
2604125 ns |
2707041 ns |
0.96 |
array/private/iteration/findmin/1d |
1990291 ns |
2059979 ns |
0.97 |
array/private/iteration/findmin/2d |
1640084 ns |
1638750 ns |
1.00 |
array/private/copy |
580229.5 ns |
566958.5 ns |
1.02 |
array/shared/copyto!/gpu_to_gpu |
80708 ns |
79375 ns |
1.02 |
array/shared/copyto!/cpu_to_gpu |
79708 ns |
81333 ns |
0.98 |
array/shared/copyto!/gpu_to_cpu |
80000 ns |
78750 ns |
1.02 |
array/shared/iteration/findall/int |
1761916.5 ns |
1657354 ns |
1.06 |
array/shared/iteration/findall/bool |
1683500 ns |
1507000 ns |
1.12 |
array/shared/iteration/findfirst/int |
1539875 ns |
1648125 ns |
0.93 |
array/shared/iteration/findfirst/bool |
1427729.5 ns |
1429542 ns |
1.00 |
array/shared/iteration/scalar |
161459 ns |
159083 ns |
1.01 |
array/shared/iteration/logical |
2442375 ns |
2359208 ns |
1.04 |
array/shared/iteration/findmin/1d |
1511833 ns |
1598729.5 ns |
0.95 |
array/shared/iteration/findmin/2d |
1630792 ns |
1642520.5 ns |
0.99 |
array/shared/copy |
250604 ns |
253958 ns |
0.99 |
array/permutedims/4d |
2465500 ns |
2460792 ns |
1.00 |
array/permutedims/2d |
1249208.5 ns |
1249583.5 ns |
1.00 |
array/permutedims/3d |
1743167 ns |
1743375 ns |
1.00 |
metal/synchronization/stream |
14875 ns |
14875 ns |
1 |
metal/synchronization/context |
15708 ns |
15500 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
|
Your PR requires formatting changes to meet the project's style guidelines. Click here to view the suggested changes.diff --git a/test/runtests.jl b/test/runtests.jl
index 9b6b0c3d..6d16c110 100644
--- a/test/runtests.jl
+++ b/test/runtests.jl
@@ -11,7 +11,7 @@ if parse(Bool, get(ENV, "BUILDKITE", "false"))
end
using Pkg
-Pkg.add(url="https://github.com/christiangnrd/GPUArrays.jl", rev="accumulatetests")
+Pkg.add(url = "https://github.com/christiangnrd/GPUArrays.jl", rev = "accumulatetests")
# Quit without erroring if Metal loaded without issues on unsupported platforms
if !Sys.isapple() |
|
As expected, some small regressions for most accumulate benchmarks, with a massive regression when accumulating along rows of a 3x1000000 matrix.
|
accumulate implementation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #625 +/- ##
==========================================
- Coverage 80.63% 80.35% -0.29%
==========================================
Files 61 60 -1
Lines 2722 2678 -44
==========================================
- Hits 2195 2152 -43
+ Misses 527 526 -1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
accumulate implementationaccumulate implementation
I don't see a massive slowdown? |
|
@maleadt The accumulate |
|
Oh OK, I didn't consider 2x a "massive slowdown" :-) Still something to look at of course, but much less dramatic than the 7x regressions we e.g. saw against CUDA.jl's reduction. |
b296d15 to
84f519a
Compare
Opened to run benchmarks.
Todo: