Skip to content

Conversation

adriangb
Copy link
Contributor

@adriangb adriangb commented Aug 27, 2025

A piece of #17273 (comment).

This refactor will allow us to evolve the scan() API while minimizing API churn, e.g. to:

  1. Incorporate sorting information from Push down preferred sorts into TableScan logical plan node #17337
  2. Allow scan() to more easily evaluate some filters (Add TableProvider::scan_with_args to support pushdown sorting #17273 (comment))

@github-actions github-actions bot added core Core DataFusion crate catalog Related to the catalog crate labels Aug 27, 2025
@adriangb adriangb force-pushed the refactor-scan-to-scan-with-args branch from 2818549 to f3273b7 Compare August 27, 2025 19:48
@adriangb adriangb force-pushed the refactor-scan-to-scan-with-args branch 2 times, most recently from 96ea7e8 to a523803 Compare September 4, 2025 17:45
@adriangb adriangb requested review from xudong963 and alamb September 5, 2025 16:21
Comment on lines 1169 to 1179
let options = ScanArgs::default()
.with_projection(projection.cloned())
.with_filters(Some(filters.to_vec()))
.with_limit(limit);
Ok(self.scan_with_args(state, options).await?.plan())
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the biggest question here is how to handle backwards compatibility. I implemented both methods in terms of the other which is nice because:

  1. Existing users are not forced to implement scan_with_args
  2. We can start using scan_with_args in the physical planner.

But if you implement neither I think you'll get a recursion error.

I also chose not to deprecate scan() to (1) avoid a bunch of code churn mostly for us and (2) until we are confident that scan_with_args is a workable evolution

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the recursion error is unfortunate -- maybe we could avoid problems with good enough documentation but it seems like a rough edge on one of the core experiences (TableProvider)

Here is an alternate proposal:

  1. leave scan required (and leave scan_with_args in terms of scan)

Once we are happy with scan_with_args we can then properly deprecate (maybe even remove scan) in favor of scan_with_args as a follow on PR (in some future PR)

This approach has the downside that it is more complicated to implement scan_with_args as you will be forced to also provide a not implemented error or something.

The upside is that there is no change needed to existing code and those who aren't yet interested in scan_with_args won't need a change

I also chose not to deprecate scan() to (1) avoid a bunch of code churn mostly for us and (2) until we are confident that scan_with_args is a workable evolution

This seems like a good idea to me

@adriangb
Copy link
Contributor Author

@alamb could you give this a look? I don't think this is blocked by anything and we already discussed it a bit offline.

@alamb
Copy link
Contributor

alamb commented Sep 11, 2025

@alamb could you give this a look? I don't think this is blocked by anything and we already discussed it a bit offline.

I will try and review it either later today or tomorrow

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @adriangb and @xudong963 -- in general I think this PR is a nice improvement and step in the right direction.

I had some small API suggestions (mostly to avoid cloning)

my only blocking concern the infinite recursion problem -- I left a suggestion for how to address it. Let me know what you think

Comment on lines 1169 to 1179
let options = ScanArgs::default()
.with_projection(projection.cloned())
.with_filters(Some(filters.to_vec()))
.with_limit(limit);
Ok(self.scan_with_args(state, options).await?.plan())
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the recursion error is unfortunate -- maybe we could avoid problems with good enough documentation but it seems like a rough edge on one of the core experiences (TableProvider)

Here is an alternate proposal:

  1. leave scan required (and leave scan_with_args in terms of scan)

Once we are happy with scan_with_args we can then properly deprecate (maybe even remove scan) in favor of scan_with_args as a follow on PR (in some future PR)

This approach has the downside that it is more complicated to implement scan_with_args as you will be forced to also provide a not implemented error or something.

The upside is that there is no change needed to existing code and those who aren't yet interested in scan_with_args won't need a change

I also chose not to deprecate scan() to (1) avoid a bunch of code churn mostly for us and (2) until we are confident that scan_with_args is a workable evolution

This seems like a good idea to me

///
/// Returns a cloned copy of the projection column indices, or `None` if
/// no projection was specified (meaning all columns should be included).
pub fn projection(&self) -> Option<Vec<usize>> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should return a reference (Option<&Vec<usize>>) to avoid forcing clones even if the caller doesn't need a copy

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub fn projection(&self) -> Option<Vec<usize>> {
pub fn projection(&self) -> Option<&Vec<usize>> {

/// Get the filter expressions for the scan.
///
/// Returns a reference to the filter expressions, or `None` if no filters were specified.
pub fn filters(&self) -> Option<&[Expr]> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) I recommend returing the raw Vec reference so the caller can use Vec specific functions

Suggested change
pub fn filters(&self) -> Option<&[Expr]> {
pub fn filters(&self) -> Option<&Vec<Expr>> {

Comment on lines 422 to 425
/// `ScanResult` encapsulates the [`ExecutionPlan`] produced by a table scan,
/// providing a typed return value instead of returning the plan directly.
/// This allows for future extensibility of scan results without breaking
/// the API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How much value does this comment add? While accurate I don't think it provides much value and just makes the code harder to read (sounds like an LLM wrote it lol)

///
/// Returns a cloned reference to the [`ExecutionPlan`] that will perform
/// the actual table scanning and data retrieval.
pub fn plan(&self) -> Arc<dyn ExecutionPlan> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also recommend here providing this as a reference as well as providing a way to get the plan without a clone

Suggested change
pub fn plan(&self) -> Arc<dyn ExecutionPlan> {
pub fn plan(&self) -> &Arc<dyn ExecutionPlan> {

And then also add a function like ScanResult::into_inner to return the inner plan

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another nice usability thing would be to provide a From impl so we can make thse like ScanResult::from(plan)

.scan(session_state, projection.as_ref(), &filters, *fetch)
.await?
let opts = ScanArgs::default()
.with_projection(projection.clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a reason we need to cone the projection here? I think it would be better and avoid the clone and pass in the arguments via reference (so something like ScanArgs<'a>)

@alamb alamb changed the title Refactor TableProvider::scan into TableProvider::scan_with_args Add TableProvider::scan_with_args Sep 12, 2025
@adriangb adriangb force-pushed the refactor-scan-to-scan-with-args branch from 58845c0 to 1d5bad1 Compare September 12, 2025 16:16
@adriangb
Copy link
Contributor Author

@alamb I believe I've addressed your feedback, thank you for the review

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @adriangb

I left some small suggestions, but nothing we can't fix after merge (or never)

@adriangb adriangb merged commit 7b45934 into apache:main Sep 15, 2025
28 checks passed
@adriangb adriangb deleted the refactor-scan-to-scan-with-args branch September 15, 2025 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
catalog Related to the catalog crate core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants