-
Notifications
You must be signed in to change notification settings - Fork 1k
perf: add optimized zip implementation for scalars #8653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This is useful for `IF <expr> THEN <scalar> ELSE <scalar> END` TODO: - [ ] Need to add comments if missing - [ ] Add benchmark
| let scalars: Vec<T::Native> = predicate | ||
| .iter() | ||
| .map(|b| if b { then_val } else { else_val }) | ||
| .collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will probably use conditional move
| fn combine_nulls_and_false(predicate: &BooleanArray) -> BooleanBuffer { | ||
| if let Some(nulls) = predicate.nulls().filter(|n| n.null_count() > 0) { | ||
| predicate.values().bitand( | ||
| // nulls are represented as 0 (false) in the values buffer | ||
| nulls.inner(), | ||
| ) | ||
| } else { | ||
| predicate.values().clone() | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm pretty sure there is already a helper function in arrow for this
# Which issue does this PR close? N/A # Rationale for this change I have a PR to improve zip perf for scalar but I don't see any benchmarks for it: - #8653 # What changes are included in this PR? created zip benchmarks for scalar and non scalar with different masks # Are these changes tested? N/A # Are there any user-facing changes? Nope
|
@alamb If you wanna run the benchmarks for zip, there are no more optimization left for this PR, only cleanups, tests and comments I saw for scalars major improvements while in array and scalar regression for some reason (maybe the extra check? even though it is a simple comparison. I run it on bare metal to reduce noise as much as possible) I tests it on: $ neofetch
.-/+oossssoo+/-. ubuntu@ip-
`:+ssssssssssssssssss+:` -----------------------
-+ssssssssssssssssssyyssss+- OS: Ubuntu 24.04.3 LTS x86_64
.ossssssssssssssssssdMMMNysssso. Host: c5.metal 1.0
/ssssssssssshdmmNNmmyNMMMMhssssss/ Kernel: 6.14.0-1011-aws
+ssssssssshmydMMMMMMMNddddyssssssss+ Uptime: 3 hours, 46 mins
/sssssssshNMMMyhhyyyyhmNMMMNhssssssss/ Packages: 921 (dpkg), 5 (snap)
.ssssssssdMMMNhsssssssssshNMMMdssssssss. Shell: bash 5.2.21
+sssshhhyNMMNyssssssssssssyNMMMysssssss+ Terminal: /dev/pts/0
ossyNMMMNyMMhsssssssssssssshmmmhssssssso CPU: Intel Xeon Platinum 8275CL (96) @ 3.900GHz
ossyNMMMNyMMhsssssssssssssshmmmhssssssso Memory: 2144MiB / 193025MiB
+sssshhhyNMMNyssssssssssssyNMMMysssssss+
.ssssssssdMMMNhsssssssssshNMMMdssssssss.
/sssssssshNMMMyhhyyyyhdNMMMNhssssssss/
+sssssssssdmydMMMMMMMMddddyssssssss+
/ssssssssssshdmNNNNmyNMMMMhssssss/
.ossssssssssssssssssdMMMNysssso.
-+sssssssssssssssssyyyssss+-
`:+ssssssssssssssssss+:`
.-/+oossssoo+/-.
|
this will be used in: - apache#8653
# Conflicts: # arrow-buffer/src/buffer/mutable.rs
…e-zip-for-scalars
|
Thank you @rluvaton -- I have scheduled benchmarks for this PR and reviewed the dependent ones. Exciting stuff |
|
🤖 |
|
The zip benchmarks are still running... Maybe we should trim them back a bit |
|
🤖: Benchmark completed Details
|
|
Good, it looks like we have massive speedups |
# Which issue does this PR close? N/A # Rationale for this change doing `OffsetBuffer::from_lengths(std::iter::repeat_n(size, value.len()));` does not utilize SIMD (I explain further if you want) See [GodBolt Link](https://godbolt.org/z/PTsfvfjqx) Extracted from: - #8653 After this and the pr below is merged will improve the datafusion scalar to array to use this and make it really really fast: - #8658 # What changes are included in this PR? added new function # Are these changes tested? yes # Are there any user-facing changes? yes
yes, nice work! |
# Which issue does this PR close? N/A # Rationale for this change I want to repeat the same value multiple times in a very fast way which will be used in: - #8653 After this and the pr below is merged will improve the datafusion scalar to array to use this and make it really really fast: - #8656 # What changes are included in this PR? Created a function in `MutableBuffer` to repeat a slice a number of times in a logarithmic way to reduce memcopy calls # Are these changes tested? Yes # Are there any user-facing changes? Yes, and added docs ------- Extracted from: - #8653 Benchmark results on local machine | Slice Length | Repetitions (n) | repeat_slice_n_times | extend_from_slice loop | Speedup | |--------------|-----------------|----------------------|------------------------|---------| | 3 | 3 | 47.092 ns | 41.910 ns | 0.89x | | 3 | 64 | 63.548 ns | 222.29 ns | 3.50x | | 3 | 1024 | 105.57 ns | 3.031 µs | 28.7x | | 3 | 8192 | 405.71 ns | 24.170 µs | 59.6x | | 20 | 3 | 48.437 ns | 46.437 ns | 0.96x | | 20 | 64 | 74.993 ns | 319.04 ns | 4.25x | | 20 | 1024 | 350.94 ns | 4.437 µs | 12.6x | | 20 | 8192 | 2.440 µs | 35.524 µs | 14.6x | | 100 | 3 | 50.369 ns | 47.568 ns | 0.94x | | 100 | 64 | 119.70 ns | 165.37 ns | 1.38x | | 100 | 1024 | 1.734 µs | 2.623 µs | 1.51x | | 100 | 8192 | 10.615 µs | 19.750 µs | 1.86x | these are the results: <details> <summary>Result</summary> ``` MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=3 time: [46.719 ns 47.092 ns 47.453 ns] Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=3 time: [41.833 ns 41.910 ns 41.996 ns] Found 11 outliers among 100 measurements (11.00%) 9 (9.00%) high mild 2 (2.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=64 time: [62.935 ns 63.548 ns 64.183 ns] Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=64 time: [221.75 ns 222.29 ns 222.86 ns] Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=1024 time: [105.15 ns 105.57 ns 106.01 ns] Found 1 outliers among 100 measurements (1.00%) 1 (1.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=1024 time: [3.0240 µs 3.0308 µs 3.0395 µs] Found 11 outliers among 100 measurements (11.00%) 2 (2.00%) low mild 5 (5.00%) high mild 4 (4.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=8192 time: [401.57 ns 405.71 ns 409.94 ns] Found 6 outliers among 100 measurements (6.00%) 6 (6.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=8192 time: [24.124 µs 24.170 µs 24.222 µs] Found 5 outliers among 100 measurements (5.00%) 3 (3.00%) high mild 2 (2.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=3 time: [48.287 ns 48.437 ns 48.606 ns] Found 8 outliers among 100 measurements (8.00%) 5 (5.00%) high mild 3 (3.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=3 time: [46.289 ns 46.437 ns 46.611 ns] Found 6 outliers among 100 measurements (6.00%) 3 (3.00%) high mild 3 (3.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=64 time: [74.625 ns 74.993 ns 75.395 ns] Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=64 time: [318.20 ns 319.04 ns 319.98 ns] Found 8 outliers among 100 measurements (8.00%) 3 (3.00%) high mild 5 (5.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=1024 time: [346.66 ns 350.94 ns 355.17 ns] Found 3 outliers among 100 measurements (3.00%) 1 (1.00%) low mild 2 (2.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=1024 time: [4.4251 µs 4.4369 µs 4.4506 µs] Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low mild 2 (2.00%) high mild 5 (5.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=8192 time: [2.4336 µs 2.4401 µs 2.4465 µs] Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) high mild 1 (1.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=8192 time: [35.466 µs 35.524 µs 35.589 µs] Found 4 outliers among 100 measurements (4.00%) 1 (1.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=3 time: [50.209 ns 50.369 ns 50.530 ns] Found 5 outliers among 100 measurements (5.00%) 5 (5.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=3 time: [47.439 ns 47.568 ns 47.701 ns] Found 2 outliers among 100 measurements (2.00%) 2 (2.00%) high mild MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=64 time: [117.77 ns 119.70 ns 122.00 ns] Found 12 outliers among 100 measurements (12.00%) 7 (7.00%) high mild 5 (5.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=64 time: [164.88 ns 165.37 ns 166.07 ns] Found 6 outliers among 100 measurements (6.00%) 5 (5.00%) high mild 1 (1.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=1024 time: [1.7278 µs 1.7335 µs 1.7398 µs] Found 7 outliers among 100 measurements (7.00%) 1 (1.00%) low mild 5 (5.00%) high mild 1 (1.00%) high severe MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=1024 time: [2.6176 µs 2.6232 µs 2.6305 µs] Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) high mild 4 (4.00%) high severe MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=8192 time: [10.583 µs 10.615 µs 10.649 µs] Found 3 outliers among 100 measurements (3.00%) 3 (3.00%) high mild MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=8192 time: [19.471 µs 19.750 µs 20.185 µs] Found 9 outliers among 100 measurements (9.00%) 2 (2.00%) high mild 7 (7.00%) high severe ``` </details>
# Conflicts: # arrow-buffer/src/buffer/mutable.rs
Waiting for the PRs below to be merged first:
zipkernel benchmarks #8654 - zip benchmarksThis PR include the following other PRs (unless merged) to make the review easier, so please make sure to review them first
repeat_slice_n_timestoMutableBuffer#8658 - extracted from thisWhich issue does this PR close?
N/A
Rationale for this change
Making zip really fast for scalars
This is useful for
IF <expr> THEN <literal> ELSE <literal> ENDWhat changes are included in this PR?
Created couple of implementation for zipping scalar, for primitive, bytes and fallback
Are these changes tested?
existing tests
Are there any user-facing changes?
new struct
ScalarZipperTODO: