Add a SpillingPool to manage collections of spill files #18207

adriangb · 2025-10-21T18:08:17Z

Addresses #18014 (comment), potentially paves the path to solve #18011 for other operators as well

adriangb · 2025-10-21T18:15:54Z

Marking as draft for now. Open to input but needs a bit more work. I'm still familiarizing myself with the spilling infrastructure.

2010YOUY01 · 2025-10-22T03:20:54Z

This PR is setting size limit to spill files, when the size exceeds threshold, the spiller rotates to new file. I'm wondering why this design? Now the spill writer and reader is able to do streaming read/write, so a large spill file usually won't be the issue, unless it needs more parallelism somewhere.

adriangb · 2025-10-22T11:11:44Z

This PR is setting size limit to spill files, when the size exceeds threshold, the spiller rotates to new file. I'm wondering why this design? Now the spill writer and reader is able to do streaming read/write, so a large spill file usually won't be the issue, unless it needs more parallelism somewhere.

The issue with using a single FIFO file is that you accumulate dead data, bloating disk usage considerably. The idea is to cap that at say 100MB and then start a new file so that once all of the original file has been consumed we can garbage collect it.

adriangb · 2025-10-22T21:42:24Z

@2010YOUY01 let me know if that makes sense, there's an example of this issue in #18011

Copilot

Pull Request Overview

This PR introduces a SpillPool abstraction to centralize the management of spill files with FIFO semantics. The pool handles file rotation, batching multiple record batches into single files up to a configurable size limit, and provides streaming read access to spilled data.

Key changes:

Adds a new SpillPool module with FIFO queue semantics for managing spill files
Integrates SpillPool into RepartitionExec to replace the previous one-file-per-batch approach
Adds a new configuration option max_spill_file_size_bytes (default 100MB) to control when spill files rotate

Reviewed Changes

Copilot reviewed 6 out of 6 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
datafusion/physical-plan/src/spill/spill_pool.rs	New module implementing SpillPool and SpillPoolStream with comprehensive tests
datafusion/physical-plan/src/spill/mod.rs	Exports the new spill_pool module
datafusion/physical-plan/src/repartition/mod.rs	Refactored to use SpillPool instead of one-file-per-batch spilling
datafusion/common/src/config.rs	Adds max_spill_file_size_bytes configuration option
docs/source/user-guide/configs.md	Documents the new max_spill_file_size_bytes configuration
datafusion/sqllogictest/test_files/information_schema.slt	Updates test expectations to include new configuration option

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

Copilot · 2025-10-22T21:45:24Z

datafusion/physical-plan/src/repartition/mod.rs

-                        }
+            // If input is finished, don't poll channel anymore
+            if self.input_finished {
+                continue;


This continue statement creates an infinite loop when input_finished is true and spill_stream returns Pending. Since the channel won't be polled and spill_stream already returned Pending, the loop will continue indefinitely without waiting. The logic at line 1404-1406 should have returned Poll::Ready(None) when input is finished and no spilled data is available, but this code path bypasses that check.

Suggested change

continue;

return Poll::Ready(None);

Copilot · 2025-10-22T21:45:24Z

datafusion/physical-plan/src/spill/spill_pool.rs

+}
+
+impl SpillFile {
+    fn new(file: RefCountedTempFile, _total_batches: usize, _total_size: usize) -> Self {


The parameters _total_batches and _total_size are unused. If these are intended for future use or debugging, consider adding a comment explaining their purpose. Otherwise, remove them to simplify the API.

Suggested change

fn new(file: RefCountedTempFile, _total_batches: usize, _total_size: usize) -> Self {

fn new(file: RefCountedTempFile) -> Self {

Copilot · 2025-10-22T21:45:24Z

datafusion/physical-plan/src/repartition/mod.rs

+                        // All inputs finished, wait for spill stream to have more data or finish
+                        return Poll::Pending;


When all_inputs_finished is true and spill_stream returns Pending, this returns Pending without checking if the spill stream will ever produce more data. According to the SpillPool implementation, once finalized, the stream should return None when there's no more data. However, if finalize() fails or wasn't called, this could hang indefinitely. Consider adding a check to ensure the spill pool was properly finalized before returning Pending.

Suggested change

// All inputs finished, wait for spill stream to have more data or finish

return Poll::Pending;

// All inputs finished, check if the spill pool is finalized.

let is_finalized = {

let pool = self.spill_pool.lock();

pool.is_finalized()

};

if is_finalized {

// If finalized and still pending, treat as end of stream to avoid hanging.

return Poll::Ready(None);

} else {

// Not finalized yet, wait for more data or finalization.

return Poll::Pending;

}

Copilot · 2025-10-22T21:45:25Z

datafusion/physical-plan/src/repartition/mod.rs

+                        // Flush and finalize the SpillPool
+                        {
+                            let mut pool = self.spill_pool.lock();
+                            pool.flush().ok();


Silently ignoring flush errors with .ok() could lead to data loss if the flush fails. Consider logging the error or returning it to the caller for proper error handling.

Suggested change

pool.flush().ok();

if let Err(e) = pool.flush() {

return Poll::Ready(Some(Err(DataFusionError::External(Box::new(e)))));

}

Copilot · 2025-10-22T21:45:25Z

datafusion/physical-plan/src/repartition/mod.rs

+                    // Flush and finalize the SpillPool
+                    {
+                        let mut pool = self.spill_pool.lock();
+                        pool.flush().ok();


Silently ignoring flush errors with .ok() could lead to data loss if the flush fails. Consider logging the error or returning it to the caller for proper error handling.

Suggested change

pool.flush().ok();

if let Err(e) = pool.flush() {

return Poll::Ready(Some(Err(e)));

}

adriangb requested a review from crepererum October 21, 2025 18:09

github-actions bot added sqllogictest SQL Logic Tests (.slt) common Related to common crate physical-plan Changes to the physical-plan crate labels Oct 21, 2025

adriangb marked this pull request as draft October 21, 2025 18:15

adriangb removed the request for review from crepererum October 21, 2025 18:15

github-actions bot added the documentation Improvements or additions to documentation label Oct 21, 2025

adriangb force-pushed the spilling-pool branch from 6b801ae to f7c84fe Compare October 22, 2025 19:58

adriangb added 7 commits October 22, 2025 16:35

implement SpillingPool

5b5016e

clippy

61e17da

use buffered stream

e9787f9

lint

ce6189f

rework

821ca81

Add some tests

903822a

fix lints

c5b40ee

adriangb force-pushed the spilling-pool branch from f7c84fe to c5b40ee Compare October 22, 2025 21:35

bugfix

2f7d4a4

adriangb requested a review from Copilot October 22, 2025 21:44

Copilot AI reviewed Oct 22, 2025

View reviewed changes

address pr feedback

1c6b1e8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add a SpillingPool to manage collections of spill files #18207

Add a SpillingPool to manage collections of spill files #18207

adriangb commented Oct 21, 2025 •

edited

Loading

Uh oh!

adriangb commented Oct 21, 2025

Uh oh!

2010YOUY01 commented Oct 22, 2025

Uh oh!

adriangb commented Oct 22, 2025

Uh oh!

adriangb commented Oct 22, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Oct 22, 2025

Uh oh!

Copilot AI Oct 22, 2025

Uh oh!

Copilot AI Oct 22, 2025

Uh oh!

Copilot AI Oct 22, 2025

Uh oh!

Copilot AI Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	fn new(file: RefCountedTempFile, _total_batches: usize, _total_size: usize) -> Self {
	fn new(file: RefCountedTempFile) -> Self {

		// All inputs finished, wait for spill stream to have more data or finish
		return Poll::Pending;

-                        // All inputs finished, wait for spill stream to have more data or finish
-                        return Poll::Pending;
+                        // All inputs finished, check if the spill pool is finalized.
+                        let is_finalized = {
+                            let pool = self.spill_pool.lock();
+                            pool.is_finalized()
+                        };
+                        if is_finalized {
+                            // If finalized and still pending, treat as end of stream to avoid hanging.
+                            return Poll::Ready(None);
+                        } else {
+                            // Not finalized yet, wait for more data or finalization.
+                            return Poll::Pending;
+                        }

-                            pool.flush().ok();
+                            if let Err(e) = pool.flush() {
+                                return Poll::Ready(Some(Err(DataFusionError::External(Box::new(e)))));
+                            }

Add a SpillingPool to manage collections of spill files #18207

Are you sure you want to change the base?

Add a SpillingPool to manage collections of spill files #18207

Conversation

adriangb commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adriangb commented Oct 21, 2025

Uh oh!

2010YOUY01 commented Oct 22, 2025

Uh oh!

adriangb commented Oct 22, 2025

Uh oh!

adriangb commented Oct 22, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adriangb commented Oct 21, 2025 •

edited

Loading