-
Notifications
You must be signed in to change notification settings - Fork 31
Improve StreamingResult iteration performance with parallel prefetching #157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
douglas
wants to merge
4
commits into
rinsed-org:master
Choose a base branch
from
douglas:dsa/improve-streaming-result-iteration-performance
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
8fec67d
Add configurable thread pool for parallel partition prefetching in St…
douglas 08e403b
Add ensure block for thread pool cleanup on exceptions
douglas 3f18852
[Passing-by] Updating the Gemfile for the 1.5.0 release
douglas 80498ec
Address PR review feedback
douglas File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -6,40 +6,52 @@ | |
|
|
||
| module RubySnowflake | ||
| class StreamingResult < Result | ||
| def initialize(partition_count, row_type_data, retreive_proc) | ||
| def initialize(partition_count, row_type_data, retreive_proc, prefetch_threads: 1) | ||
| raise ArgumentError, "prefetch_threads must be a positive integer, got: #{prefetch_threads}" unless prefetch_threads.is_a?(Integer) && prefetch_threads > 0 | ||
|
|
||
| super(partition_count, row_type_data) | ||
| @retreive_proc = retreive_proc | ||
| @prefetch_threads = prefetch_threads | ||
| end | ||
|
|
||
| def each | ||
| return to_enum(:each) unless block_given? | ||
|
|
||
| thread_pool = Concurrent::FixedThreadPool.new 1 | ||
| thread_pool = Concurrent::FixedThreadPool.new(@prefetch_threads) | ||
|
|
||
| begin | ||
| data.each_with_index do |_partition, index| | ||
| @prefetch_threads.times do |offset| | ||
| next_index = index + offset + 1 | ||
| break if next_index >= data.size | ||
|
|
||
| data.each_with_index do |_partition, index| | ||
| next_index = [index+1, data.size-1].min | ||
| if data[next_index].nil? # prefetch | ||
| data[next_index] = Concurrent::Future.execute(executor: thread_pool) do | ||
| @retreive_proc.call(next_index) | ||
| if data[next_index].nil? | ||
| data[next_index] = Concurrent::Future.execute(executor: thread_pool) do | ||
| @retreive_proc.call(next_index) | ||
| end | ||
| end | ||
| end | ||
| end | ||
|
|
||
| if data[index].is_a? Concurrent::Future | ||
| data[index] = data[index].value # wait for it to finish | ||
| end | ||
| if data[index].is_a? Concurrent::Future | ||
| data[index] = data[index].value # wait for it to finish | ||
| end | ||
|
|
||
| data[index].each do |row| | ||
| yield wrap_row(row) | ||
| end | ||
| data[index].each do |row| | ||
| yield wrap_row(row) | ||
| end | ||
|
|
||
| # After iterating over the current partition, clear the data to release memory | ||
| data[index].clear | ||
| # After iterating over the current partition, clear the data to release memory | ||
| data[index].clear | ||
|
|
||
| # Reassign to a symbol so: | ||
| # - When looking at the list of partitions in `data` it is easier to detect | ||
| # - Will raise an exception if `data.each` is attempted to be called again | ||
| # - It won't trigger prefetch detection as `next_index` | ||
| data[index] = :finished | ||
| # Reassign to a symbol so: | ||
| # - When looking at the list of partitions in `data` it is easier to detect | ||
| # - Will raise an exception if `data.each` is attempted to be called again | ||
| # - It won't trigger prefetch detection as `next_index` | ||
| data[index] = :finished | ||
| end | ||
| ensure | ||
| thread_pool.shutdown | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. ooh, this is a good catch - we should have been doing that already
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
| thread_pool.wait_for_termination(5) | ||
| end | ||
| end | ||
|
|
||
|
|
||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,193 @@ | ||
| # frozen_string_literal: true | ||
|
|
||
| require 'spec_helper' | ||
|
|
||
| RSpec.describe RubySnowflake::StreamingResult do | ||
| let(:partition_count) { 5 } | ||
| let(:row_type_data) { [{ "name" => "id", "type" => "fixed" }, { "name" => "value", "type" => "text" }] } | ||
| let(:partitions_data) do | ||
| [ | ||
| [[1, "first"], [2, "second"]], | ||
| [[3, "third"], [4, "fourth"]], | ||
| [[5, "fifth"], [6, "sixth"]], | ||
| [[7, "seventh"], [8, "eighth"]], | ||
| [[9, "ninth"], [10, "tenth"]] | ||
| ] | ||
| end | ||
| let(:retrieve_proc) { ->(index) { partitions_data[index] } } | ||
|
|
||
| describe '#initialize' do | ||
| context 'with default prefetch_threads' do | ||
| subject { described_class.new(partition_count, row_type_data, retrieve_proc) } | ||
|
|
||
| it 'initializes with 1 prefetch thread by default' do | ||
| expect(subject.instance_variable_get(:@prefetch_threads)).to eq(1) | ||
| end | ||
| end | ||
|
|
||
| context 'with custom prefetch_threads' do | ||
| subject { described_class.new(partition_count, row_type_data, retrieve_proc, prefetch_threads: 4) } | ||
|
|
||
| it 'initializes with specified prefetch threads' do | ||
| expect(subject.instance_variable_get(:@prefetch_threads)).to eq(4) | ||
| end | ||
| end | ||
|
|
||
| context 'with invalid prefetch_threads' do | ||
| it 'raises ArgumentError for zero' do | ||
| expect { described_class.new(partition_count, row_type_data, retrieve_proc, prefetch_threads: 0) } | ||
| .to raise_error(ArgumentError, /prefetch_threads must be a positive integer/) | ||
| end | ||
|
|
||
| it 'raises ArgumentError for negative values' do | ||
| expect { described_class.new(partition_count, row_type_data, retrieve_proc, prefetch_threads: -1) } | ||
| .to raise_error(ArgumentError, /prefetch_threads must be a positive integer/) | ||
| end | ||
| end | ||
| end | ||
|
|
||
| describe '#each' do | ||
| subject { described_class.new(partition_count, row_type_data, retrieve_proc, prefetch_threads: prefetch_threads) } | ||
|
|
||
| before do | ||
| # Populate first partition (as done in StreamingResultStrategy) | ||
| subject[0] = partitions_data[0] | ||
| end | ||
|
|
||
| context 'with single thread (backward compatible behavior)' do | ||
| let(:prefetch_threads) { 1 } | ||
|
|
||
| it 'iterates through all rows correctly' do | ||
| rows = [] | ||
| subject.each { |row| rows << [row["id"], row["value"]] } | ||
|
|
||
| expect(rows).to eq([ | ||
| [1, "first"], [2, "second"], | ||
| [3, "third"], [4, "fourth"], | ||
| [5, "fifth"], [6, "sixth"], | ||
| [7, "seventh"], [8, "eighth"], | ||
| [9, "ninth"], [10, "tenth"] | ||
| ]) | ||
| end | ||
|
|
||
| it 'clears processed partitions to save memory' do | ||
| rows = [] | ||
| subject.each { |row| rows << row } | ||
|
|
||
| # Check that partitions were cleared (marked as :finished) | ||
| expect(subject.instance_variable_get(:@data)[0]).to eq(:finished) | ||
| expect(subject.instance_variable_get(:@data)[1]).to eq(:finished) | ||
| end | ||
|
|
||
| it 'calls retrieve_proc for each partition' do | ||
| call_count = 0 | ||
| instrumented_proc = lambda do |index| | ||
| call_count += 1 | ||
| partitions_data[index] | ||
| end | ||
|
|
||
| result = described_class.new(partition_count, row_type_data, instrumented_proc, prefetch_threads: 1) | ||
| result[0] = partitions_data[0] | ||
|
|
||
| result.each { |row| row } | ||
|
|
||
| # Should call for partitions 1-4 (partition 0 was pre-populated) | ||
| expect(call_count).to eq(4) | ||
| end | ||
| end | ||
|
|
||
| context 'with multiple threads' do | ||
| let(:prefetch_threads) { 3 } | ||
|
|
||
| it 'iterates through all rows correctly' do | ||
| rows = [] | ||
| subject.each { |row| rows << [row["id"], row["value"]] } | ||
|
|
||
| expect(rows).to eq([ | ||
| [1, "first"], [2, "second"], | ||
| [3, "third"], [4, "fourth"], | ||
| [5, "fifth"], [6, "sixth"], | ||
| [7, "seventh"], [8, "eighth"], | ||
| [9, "ninth"], [10, "tenth"] | ||
| ]) | ||
| end | ||
|
|
||
| it 'prefetches multiple partitions in parallel' do | ||
| # Track concurrent fetches | ||
| concurrent_fetches = [] | ||
| mutex = Mutex.new | ||
| instrumented_proc = lambda do |index| | ||
| mutex.synchronize { concurrent_fetches << index } | ||
| sleep 0.01 # Simulate network latency | ||
| partitions_data[index] | ||
| end | ||
|
|
||
| result = described_class.new(partition_count, row_type_data, instrumented_proc, prefetch_threads: 3) | ||
| result[0] = partitions_data[0] | ||
|
|
||
| result.each { |row| row } | ||
|
|
||
| # With 3 threads, should prefetch indices 1, 2, 3 before processing them | ||
| expect(concurrent_fetches).to include(1, 2, 3) | ||
| end | ||
|
|
||
| it 'properly shuts down thread pool' do | ||
| thread_pool = nil | ||
| allow(Concurrent::FixedThreadPool).to receive(:new).and_wrap_original do |method, *args| | ||
| thread_pool = method.call(*args) | ||
| thread_pool | ||
| end | ||
|
|
||
| subject.each { |row| row } | ||
|
|
||
| expect(thread_pool).to be_shutdown | ||
| end | ||
| end | ||
|
|
||
| context 'with more threads than partitions' do | ||
| let(:prefetch_threads) { 10 } | ||
|
|
||
| it 'handles gracefully without errors' do | ||
| rows = [] | ||
| expect { subject.each { |row| rows << row } }.not_to raise_error | ||
|
|
||
| expect(rows.length).to eq(10) | ||
| end | ||
| end | ||
|
|
||
| context 'when returning an enumerator' do | ||
| let(:prefetch_threads) { 2 } | ||
|
|
||
| it 'returns an enumerator when no block given' do | ||
| enumerator = subject.each | ||
|
|
||
| expect(enumerator).to be_a(Enumerator) | ||
| expect(enumerator.to_a.length).to eq(10) | ||
| end | ||
| end | ||
|
|
||
| context 'when an exception occurs during iteration' do | ||
| let(:prefetch_threads) { 3 } | ||
|
|
||
| it 'properly shuts down thread pool even on exception' do | ||
| thread_pool = nil | ||
| allow(Concurrent::FixedThreadPool).to receive(:new).and_wrap_original do |method, *args| | ||
| thread_pool = method.call(*args) | ||
| thread_pool | ||
| end | ||
|
|
||
| # Raise exception after processing 2 rows | ||
| count = 0 | ||
| expect do | ||
| subject.each do |row| | ||
| count += 1 | ||
| raise StandardError, "Test error" if count == 2 | ||
| end | ||
| end.to raise_error(StandardError, "Test error") | ||
|
|
||
| # Thread pool should still be shut down | ||
| expect(thread_pool).to be_shutdown | ||
| end | ||
| end | ||
| end | ||
| end |
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to raise an error if you tried to pass in a non positive number for prefetch_threads (I could reasonably see someone trying 0) which I think would cause us to never fetch data
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. Added an
ArgumentErrorininitializeifprefetch_threadsisn't a positive integer — covers 0, negatives, and non-integer values. Added specs for it too.