-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add TableProvider::scan_with_args
#17336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add TableProvider::scan_with_args
#17336
Conversation
2818549
to
f3273b7
Compare
96ea7e8
to
a523803
Compare
let options = ScanArgs::default() | ||
.with_projection(projection.cloned()) | ||
.with_filters(Some(filters.to_vec())) | ||
.with_limit(limit); | ||
Ok(self.scan_with_args(state, options).await?.plan()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the biggest question here is how to handle backwards compatibility. I implemented both methods in terms of the other which is nice because:
- Existing users are not forced to implement
scan_with_args
- We can start using
scan_with_args
in the physical planner.
But if you implement neither I think you'll get a recursion error.
I also chose not to deprecate scan()
to (1) avoid a bunch of code churn mostly for us and (2) until we are confident that scan_with_args
is a workable evolution
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the recursion error is unfortunate -- maybe we could avoid problems with good enough documentation but it seems like a rough edge on one of the core experiences (TableProvider)
Here is an alternate proposal:
- leave
scan
required (and leavescan_with_args
in terms ofscan
)
Once we are happy with scan_with_args
we can then properly deprecate (maybe even remove scan
) in favor of scan_with_args
as a follow on PR (in some future PR)
This approach has the downside that it is more complicated to implement scan_with_args
as you will be forced to also provide a not implemented error
or something.
The upside is that there is no change needed to existing code and those who aren't yet interested in scan_with_args won't need a change
I also chose not to deprecate scan() to (1) avoid a bunch of code churn mostly for us and (2) until we are confident that scan_with_args is a workable evolution
This seems like a good idea to me
@alamb could you give this a look? I don't think this is blocked by anything and we already discussed it a bit offline. |
I will try and review it either later today or tomorrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @adriangb and @xudong963 -- in general I think this PR is a nice improvement and step in the right direction.
I had some small API suggestions (mostly to avoid cloning)
my only blocking concern the infinite recursion problem -- I left a suggestion for how to address it. Let me know what you think
let options = ScanArgs::default() | ||
.with_projection(projection.cloned()) | ||
.with_filters(Some(filters.to_vec())) | ||
.with_limit(limit); | ||
Ok(self.scan_with_args(state, options).await?.plan()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the recursion error is unfortunate -- maybe we could avoid problems with good enough documentation but it seems like a rough edge on one of the core experiences (TableProvider)
Here is an alternate proposal:
- leave
scan
required (and leavescan_with_args
in terms ofscan
)
Once we are happy with scan_with_args
we can then properly deprecate (maybe even remove scan
) in favor of scan_with_args
as a follow on PR (in some future PR)
This approach has the downside that it is more complicated to implement scan_with_args
as you will be forced to also provide a not implemented error
or something.
The upside is that there is no change needed to existing code and those who aren't yet interested in scan_with_args won't need a change
I also chose not to deprecate scan() to (1) avoid a bunch of code churn mostly for us and (2) until we are confident that scan_with_args is a workable evolution
This seems like a good idea to me
datafusion/catalog/src/table.rs
Outdated
/// | ||
/// Returns a cloned copy of the projection column indices, or `None` if | ||
/// no projection was specified (meaning all columns should be included). | ||
pub fn projection(&self) -> Option<Vec<usize>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should return a reference (Option<&Vec<usize>>
) to avoid forcing clones even if the caller doesn't need a copy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub fn projection(&self) -> Option<Vec<usize>> { | |
pub fn projection(&self) -> Option<&Vec<usize>> { |
datafusion/catalog/src/table.rs
Outdated
/// Get the filter expressions for the scan. | ||
/// | ||
/// Returns a reference to the filter expressions, or `None` if no filters were specified. | ||
pub fn filters(&self) -> Option<&[Expr]> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(minor) I recommend returing the raw Vec reference so the caller can use Vec specific functions
pub fn filters(&self) -> Option<&[Expr]> { | |
pub fn filters(&self) -> Option<&Vec<Expr>> { |
datafusion/catalog/src/table.rs
Outdated
/// `ScanResult` encapsulates the [`ExecutionPlan`] produced by a table scan, | ||
/// providing a typed return value instead of returning the plan directly. | ||
/// This allows for future extensibility of scan results without breaking | ||
/// the API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How much value does this comment add? While accurate I don't think it provides much value and just makes the code harder to read (sounds like an LLM wrote it lol)
datafusion/catalog/src/table.rs
Outdated
/// | ||
/// Returns a cloned reference to the [`ExecutionPlan`] that will perform | ||
/// the actual table scanning and data retrieval. | ||
pub fn plan(&self) -> Arc<dyn ExecutionPlan> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also recommend here providing this as a reference as well as providing a way to get the plan without a clone
pub fn plan(&self) -> Arc<dyn ExecutionPlan> { | |
pub fn plan(&self) -> &Arc<dyn ExecutionPlan> { |
And then also add a function like ScanResult::into_inner
to return the inner plan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another nice usability thing would be to provide a From
impl so we can make thse like ScanResult::from(plan)
.scan(session_state, projection.as_ref(), &filters, *fetch) | ||
.await? | ||
let opts = ScanArgs::default() | ||
.with_projection(projection.clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason we need to cone the projection here? I think it would be better and avoid the clone and pass in the arguments via reference (so something like ScanArgs<'a>
)
TableProvider::scan_with_args
58845c0
to
1d5bad1
Compare
@alamb I believe I've addressed your feedback, thank you for the review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @adriangb
I left some small suggestions, but nothing we can't fix after merge (or never)
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
A piece of #17273 (comment).
This refactor will allow us to evolve the
scan()
API while minimizing API churn, e.g. to:TableScan
logical plan node #17337scan()
to more easily evaluate some filters (AddTableProvider::scan_with_args
to support pushdown sorting #17273 (comment))