perf: add `repeat_slice_n_times` to `MutableBuffer` #8658

rluvaton · 2025-10-20T11:00:59Z

Which issue does this PR close?

N/A

Rationale for this change

I want to repeat the same value multiple times in a very fast way
which will be used in:

perf: add optimized zip implementation for scalars #8653

After this and the pr below is merged will improve the datafusion scalar to array to use this and make it really really fast:

perf: add optimized function to create offset with same length #8656

What changes are included in this PR?

Created a function in MutableBuffer to repeat a slice a number of times in a logarithmic way to reduce memcopy calls

Are these changes tested?

Yes

Are there any user-facing changes?

Yes, and added docs

Extracted from:

perf: add optimized zip implementation for scalars #8653

Benchmark results on local machine

Slice Length	Repetitions (n)	repeat_slice_n_times	extend_from_slice loop	Speedup
3	3	47.092 ns	41.910 ns	0.89x
3	64	63.548 ns	222.29 ns	3.50x
3	1024	105.57 ns	3.031 µs	28.7x
3	8192	405.71 ns	24.170 µs	59.6x
20	3	48.437 ns	46.437 ns	0.96x
20	64	74.993 ns	319.04 ns	4.25x
20	1024	350.94 ns	4.437 µs	12.6x
20	8192	2.440 µs	35.524 µs	14.6x
100	3	50.369 ns	47.568 ns	0.94x
100	64	119.70 ns	165.37 ns	1.38x
100	1024	1.734 µs	2.623 µs	1.51x
100	8192	10.615 µs	19.750 µs	1.86x

these are the results:

Result

MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=3
                        time:   [46.719 ns 47.092 ns 47.453 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=3
                        time:   [41.833 ns 41.910 ns 41.996 ns]
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=64
                        time:   [62.935 ns 63.548 ns 64.183 ns]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=64
                        time:   [221.75 ns 222.29 ns 222.86 ns]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=1024
                        time:   [105.15 ns 105.57 ns 106.01 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=1024
                        time:   [3.0240 µs 3.0308 µs 3.0395 µs]
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=8192
                        time:   [401.57 ns 405.71 ns 409.94 ns]
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=8192
                        time:   [24.124 µs 24.170 µs 24.222 µs]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=3
                        time:   [48.287 ns 48.437 ns 48.606 ns]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=3
                        time:   [46.289 ns 46.437 ns 46.611 ns]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=64
                        time:   [74.625 ns 74.993 ns 75.395 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=64
                        time:   [318.20 ns 319.04 ns 319.98 ns]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=1024
                        time:   [346.66 ns 350.94 ns 355.17 ns]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=1024
                        time:   [4.4251 µs 4.4369 µs 4.4506 µs]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=8192
                        time:   [2.4336 µs 2.4401 µs 2.4465 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=8192
                        time:   [35.466 µs 35.524 µs 35.589 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=3
                        time:   [50.209 ns 50.369 ns 50.530 ns]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=3
                        time:   [47.439 ns 47.568 ns 47.701 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=64
                        time:   [117.77 ns 119.70 ns 122.00 ns]
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=64
                        time:   [164.88 ns 165.37 ns 166.07 ns]
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=1024
                        time:   [1.7278 µs 1.7335 µs 1.7398 µs]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=1024
                        time:   [2.6176 µs 2.6232 µs 2.6305 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=8192
                        time:   [10.583 µs 10.615 µs 10.649 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=8192
                        time:   [19.471 µs 19.750 µs 20.185 µs]
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

this will be used in: - apache#8653

alamb

Thank you @rluvaton

My only concern with this approach is that it might not be necessary (aka did you compare it to the simpler strategy)?

alamb · 2025-10-20T17:44:51Z

arrow-buffer/src/buffer/mutable.rs

        }
    }

+    /// Adding to this mutable buffer `slice_to_repeat` repeated `repeat_count` times.


I am wondering how much the unsafe log copying here makes a difference, vs ensuring reserve is called correctly.

Did you measure with code that was like:

reserve(slice.len() * repeat_count); for _ in 0..repeat_count { buf.extend_from_slice(slice_to_repeat) }

alamb · 2025-10-20T17:46:43Z

arrow-buffer/src/buffer/mutable.rs

+
+            unsafe {
+                // Get to the start of the data before we started copying anything
+                let src = self.data.as_ptr().add(length_before) as *const u8;


rustc can probably figure it out, but src is the same for all loop iterations so could be pulled out of the loop I think

yeah, I thought about it but decided not to do it so the code for src and dst is close

rluvaton · 2025-10-20T20:49:59Z

Yes, I committed the benchmark I tested with

Slice Length	Repetitions (n)	repeat_slice_n_times	extend_from_slice loop	Speedup
3	3	47.092 ns	41.910 ns	0.89x
3	64	63.548 ns	222.29 ns	3.50x
3	1024	105.57 ns	3.031 µs	28.7x
3	8192	405.71 ns	24.170 µs	59.6x
20	3	48.437 ns	46.437 ns	0.96x
20	64	74.993 ns	319.04 ns	4.25x
20	1024	350.94 ns	4.437 µs	12.6x
20	8192	2.440 µs	35.524 µs	14.6x
100	3	50.369 ns	47.568 ns	0.94x
100	64	119.70 ns	165.37 ns	1.38x
100	1024	1.734 µs	2.623 µs	1.51x
100	8192	10.615 µs	19.750 µs	1.86x

these are the results:

Result

MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=3
                        time:   [46.719 ns 47.092 ns 47.453 ns]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=3
                        time:   [41.833 ns 41.910 ns 41.996 ns]
Found 11 outliers among 100 measurements (11.00%)
  9 (9.00%) high mild
  2 (2.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=64
                        time:   [62.935 ns 63.548 ns 64.183 ns]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=64
                        time:   [221.75 ns 222.29 ns 222.86 ns]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=1024
                        time:   [105.15 ns 105.57 ns 106.01 ns]
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=1024
                        time:   [3.0240 µs 3.0308 µs 3.0395 µs]
Found 11 outliers among 100 measurements (11.00%)
  2 (2.00%) low mild
  5 (5.00%) high mild
  4 (4.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=3 n=8192
                        time:   [401.57 ns 405.71 ns 409.94 ns]
Found 6 outliers among 100 measurements (6.00%)
  6 (6.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=3 n=8192
                        time:   [24.124 µs 24.170 µs 24.222 µs]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=3
                        time:   [48.287 ns 48.437 ns 48.606 ns]
Found 8 outliers among 100 measurements (8.00%)
  5 (5.00%) high mild
  3 (3.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=3
                        time:   [46.289 ns 46.437 ns 46.611 ns]
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=64
                        time:   [74.625 ns 74.993 ns 75.395 ns]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=64
                        time:   [318.20 ns 319.04 ns 319.98 ns]
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=1024
                        time:   [346.66 ns 350.94 ns 355.17 ns]
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) low mild
  2 (2.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=1024
                        time:   [4.4251 µs 4.4369 µs 4.4506 µs]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  5 (5.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=20 n=8192
                        time:   [2.4336 µs 2.4401 µs 2.4465 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=20 n=8192
                        time:   [35.466 µs 35.524 µs 35.589 µs]
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=3
                        time:   [50.209 ns 50.369 ns 50.530 ns]
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=3
                        time:   [47.439 ns 47.568 ns 47.701 ns]
Found 2 outliers among 100 measurements (2.00%)
  2 (2.00%) high mild
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=64
                        time:   [117.77 ns 119.70 ns 122.00 ns]
Found 12 outliers among 100 measurements (12.00%)
  7 (7.00%) high mild
  5 (5.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=64
                        time:   [164.88 ns 165.37 ns 166.07 ns]
Found 6 outliers among 100 measurements (6.00%)
  5 (5.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=1024
                        time:   [1.7278 µs 1.7335 µs 1.7398 µs]
Found 7 outliers among 100 measurements (7.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  1 (1.00%) high severe
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=1024
                        time:   [2.6176 µs 2.6232 µs 2.6305 µs]
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) high mild
  4 (4.00%) high severe
MutableBuffer repeat slice/repeat_slice_n_times/slice_len=100 n=8192
                        time:   [10.583 µs 10.615 µs 10.649 µs]
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
MutableBuffer repeat slice/extend_from_slice loop/slice_len=100 n=8192
                        time:   [19.471 µs 19.750 µs 20.185 µs]
Found 9 outliers among 100 measurements (9.00%)
  2 (2.00%) high mild
  7 (7.00%) high severe

alamb · 2025-10-20T20:51:46Z

Yes, I committed the benchmark I tested with

👍

# Which issue does this PR close? N/A # Rationale for this change doing `OffsetBuffer::from_lengths(std::iter::repeat_n(size, value.len()));` does not utilize SIMD (I explain further if you want) See [GodBolt Link](https://godbolt.org/z/PTsfvfjqx) Extracted from: - #8653 After this and the pr below is merged will improve the datafusion scalar to array to use this and make it really really fast: - #8658 # What changes are included in this PR? added new function # Are these changes tested? yes # Are there any user-facing changes? yes

rluvaton · 2025-10-20T21:51:08Z

@alamb can you please merge

alamb · 2025-10-21T18:19:37Z

Done

perf: add repeat_slice_n_times to MutableBuffer

5f01d05

this will be used in: - apache#8653

github-actions bot added the arrow Changes to the arrow crate label Oct 20, 2025

rluvaton mentioned this pull request Oct 20, 2025

perf: add optimized zip implementation for scalars #8653

Open

5 tasks

rluvaton added a commit to rluvaton/arrow-rs that referenced this pull request Oct 20, 2025

updated with apache#8658 changes

694eb72

simplify implementation

72070dc

This was referenced Oct 20, 2025

perf: add optimized function to create offset with same length #8656

Merged

feat: add new_repeated to ByteArray #8659

Open

rluvaton added 2 commits October 20, 2025 15:49

trigger ci

e3bc9ac

cleanup

8025f6b

rluvaton mentioned this pull request Oct 20, 2025

[EPIC] A collection of items to improve CASE performance apache/datafusion#18075

Open

10 tasks

alamb approved these changes Oct 20, 2025

View reviewed changes

rluvaton added 2 commits October 20, 2025 23:40

add benchmark

4b45127

add benchmark

0adfd3b

alamb merged commit 94d51f4 into apache:main Oct 21, 2025
28 checks passed

rluvaton deleted the add-push-repeated-slice branch October 21, 2025 18:24

pepijnve mentioned this pull request Oct 23, 2025

Optimize merging of partial case expression results apache/datafusion#18152

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: add `repeat_slice_n_times` to `MutableBuffer` #8658

perf: add `repeat_slice_n_times` to `MutableBuffer` #8658

rluvaton commented Oct 20, 2025 •

edited

Loading

Uh oh!

alamb left a comment

Uh oh!

alamb Oct 20, 2025

Uh oh!

alamb Oct 20, 2025

Uh oh!

rluvaton Oct 20, 2025

Uh oh!

rluvaton commented Oct 20, 2025

Uh oh!

alamb commented Oct 20, 2025 •

edited

Loading

Uh oh!

rluvaton commented Oct 20, 2025

Uh oh!

Uh oh!

alamb commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: add repeat_slice_n_times to MutableBuffer #8658

perf: add repeat_slice_n_times to MutableBuffer #8658

Conversation

rluvaton commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

rluvaton Oct 20, 2025

Choose a reason for hiding this comment

Uh oh!

rluvaton commented Oct 20, 2025

Uh oh!

alamb commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rluvaton commented Oct 20, 2025

Uh oh!

Uh oh!

alamb commented Oct 21, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

perf: add `repeat_slice_n_times` to `MutableBuffer` #8658

perf: add `repeat_slice_n_times` to `MutableBuffer` #8658

rluvaton commented Oct 20, 2025 •

edited

Loading

alamb commented Oct 20, 2025 •

edited

Loading