-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Redesign ownership model between FileScanConfig
and FileSource
s
#17242
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Redesign ownership model between FileScanConfig
and FileSource
s
#17242
Conversation
93b9263
to
152548d
Compare
b138003
to
d6ff4cc
Compare
ae16eba
to
3177ca6
Compare
I was able to identify the root cause behind several test failures related to roundtripping physical plans in impacted test cases
tl;dr, we expect On For example, in Thoughts on fixesI see 2 possible directions for resolving the test failures:
I'm leaning towards 1 since it reflects the current lossless boundary and avoids overcomplicating serialization. But I'm curious if people have any opinions cc @adriangb @berkaysynnada @mbrubeck @xudong963 @comphead @blaginin @alamb |
I agree that there was always losses in serialization / deserialization e.g. loss of custom |
ccf7308
to
a152b7d
Compare
FileScanConfig
and FileSource
s
FileScanConfig
and FileSource
sFileScanConfig
and FileSource
s
513b7aa
to
66eac9f
Compare
fa40cc8
to
5cfd81d
Compare
5cfd81d
to
c7ea40a
Compare
b417b26
to
1bbd3ed
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, this PR is ready for review-- It's quite sizable and touches 55+ files.
I've split up the PR into 2 commits:
- Move all fields shared across
FileSource
intoFileScanConfig
- Have
FileSource
ownFileScanConfig
and directly implDataSource
I would recommend starting from datafusion/datasource/src/file_scan_config.rs
to see which fields were added to FileScanConfig
. Then, all concrete types that impl FileSource
like ParquetSource.
datafusion/datasource/src/file.rs
is the main change, as it implements DataSource
for all T: FileSource
let file_source = Arc::new( | ||
ParquetSource::default() | ||
let file_scan_config = FileScanConfigBuilder::new(object_store_url, schema) | ||
.with_limit(limit) | ||
.with_projection(projection.cloned()) | ||
.with_file(partitioned_file) | ||
.build(); | ||
|
||
let file_source = | ||
ParquetSource::new(TableParquetOptions::default(), file_scan_config.clone()) | ||
// provide the predicate so the DataSourceExec can try and prune | ||
// row groups internally | ||
.with_predicate(predicate) | ||
// provide the factory to create parquet reader without re-reading metadata | ||
.with_parquet_file_reader_factory(Arc::new(reader_factory)), | ||
); | ||
let file_scan_config = | ||
FileScanConfigBuilder::new(object_store_url, schema, file_source) | ||
.with_limit(limit) | ||
.with_projection(projection.cloned()) | ||
.with_file(partitioned_file) | ||
.build(); | ||
.with_parquet_file_reader_factory(Arc::new(reader_factory)); | ||
|
||
// Finally, put it all together into a DataSourceExec | ||
Ok(DataSourceExec::from_data_source(file_scan_config)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll see this pattern a lot in this PR.
The old flow goes something like:
- Define the file source
- Define the file scan config and move the file source inside
- Derive a data source plan from the file scan config
Inside this flow, there's a circular dependency (call file source from config, create file opener from file source but also pass in the config).
The new flow goes something like:
- Define the config
- Define the file source which now owns the config
- Derive a data source plan from the file source
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The biggest point of the refactor
1bbd3ed
to
0cc2125
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just small comments for now. will review more tomorrow
let file_source = Arc::new( | ||
ParquetSource::default() | ||
let file_scan_config = FileScanConfigBuilder::new(object_store_url, schema) | ||
.with_limit(limit) | ||
.with_projection(projection.cloned()) | ||
.with_file(partitioned_file) | ||
.build(); | ||
|
||
let file_source = | ||
ParquetSource::new(TableParquetOptions::default(), file_scan_config.clone()) | ||
// provide the predicate so the DataSourceExec can try and prune | ||
// row groups internally | ||
.with_predicate(predicate) | ||
// provide the factory to create parquet reader without re-reading metadata | ||
.with_parquet_file_reader_factory(Arc::new(reader_factory)), | ||
); | ||
let file_scan_config = | ||
FileScanConfigBuilder::new(object_store_url, schema, file_source) | ||
.with_limit(limit) | ||
.with_projection(projection.cloned()) | ||
.with_file(partitioned_file) | ||
.build(); | ||
.with_parquet_file_reader_factory(Arc::new(reader_factory)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just moving code around to initialize the FileScanConfig before the ParquetSource ✅
.with_schema(schema) | ||
.with_batch_size(8192) | ||
.with_projection(&scan_config); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It makes sense that the schema, batch size and projection are not properties inherent to CSVs and thus should be part of FileScanConfig. In fact they are currently duplicated!
I'll start review in recent two days. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks really nice!
Big picture the idea of having the shared behavior via impl <T: FileSource> DataSource for &T
is really smart: it gives us code sharing / common behavior while allowing each FileSource to override the implementations to specialize (e.g. Parquet for projection pushdown).
I have left some comments questions and would like to see some other reviews as well before approving.
But again overall commend you on the attention to detail taken in this PR. It is unfortunately massive and hard to review but I think the organization into stacked commits you did is good and I don't see any way to split this up better.
#[derive(Clone, Default)] | ||
#[derive(Clone)] | ||
pub struct ArrowSource { | ||
metrics: ExecutionPlanMetricsSet, | ||
schema_adapter_factory: Option<Arc<dyn SchemaAdapterFactory>>, | ||
config: FileScanConfig, | ||
} | ||
|
||
impl ArrowSource { | ||
/// Returns a [`ArrowSource`] | ||
pub fn new(config: FileScanConfig) -> Self { | ||
Self { | ||
metrics: Default::default(), | ||
schema_adapter_factory: None, | ||
config, | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense - all of the FileSource now need a FileScanConfig -> they can't impl Default anymmore
fn config(&self) -> &FileScanConfig { | ||
&self.config | ||
} | ||
|
||
fn with_config(&self, config: FileScanConfig) -> Arc<dyn FileSource> { | ||
let mut this = self.clone(); | ||
this.config = config; | ||
|
||
Arc::new(this) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the APIs I'm not so sure about
_partition: usize, | ||
) -> Arc<dyn FileOpener> { | ||
Arc::new(ArrowOpener { | ||
object_store, | ||
projection: base_config.file_column_projection_indices(), | ||
projection: self.config().file_column_projection_indices(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use self.config
instead, seems like not point in calling the public method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both approaches work in this case because:
- The trait method
config()
just returns&self.config
- So
self.config()
andself.config
access the same data
Use self.config()
for consistency with the trait interface, especially since:
- It maintains abstraction - using the trait method rather than direct field access
- It would work correctly even if the trait implementation changed, though it seems not to have happened
fn metrics(&self) -> &ExecutionPlanMetricsSet { | ||
fn metrics_inner(&self) -> &ExecutionPlanMetricsSet { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm is metrics_inner
now a method on FileSource
?
fn with_fetch(&self, limit: Option<usize>) -> Option<Arc<dyn DataSource>> { | ||
let config = FileScanConfigBuilder::from(self.config().clone()) | ||
.with_limit(limit) | ||
.build(); | ||
|
||
Some(self.with_config(config).as_data_source()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should invert this:
fn with_fetch(&self, limit: Option<usize>) -> Option<Arc<dyn DataSource>> {
Some(Arc::new(FileSource::with_fetch(self, limit)) as Arc<dyn DataSource>)
}
Where FileSource::with_fetch
is something along the lines of:
self.config = self.config.with_fetch(limit)
Basically can we make FileScanConfig an implementation detail one level lower.
fn as_file_source(&self) -> Option<Arc<dyn FileSource>> { | ||
None | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wondering / questioning this method
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MemorySourceConfig
impls DataSource
but not FileSource
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right but I wonder if this is the right way to go about that or if whatever is needing this sort of down casting should be a part of the trait
if let Some(file_source) = self.data_source.as_file_source() { | ||
if let Some(file_group) = file_source.config().file_groups.get(partition) | ||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the point is that not all data sources are file sources, and for the ones that aren't partitioning doesn't necessarily apply.
Can't we make this a method on the DataSource
trait so that we can implement it as Ok(Statistics::new_unknown())
for MemoryDataSource
and copy this code into impl <T: FileScan> DataSource for &T
?
pub fn downcast_to_file_source<T: FileSource + 'static>(&self) -> Option<&T> { | ||
self.data_source().as_any().downcast_ref::<T>() | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same about this method, wonder where it is used. It's a nice convenience but to me this sort of downcasting is always a smell of missing design on the trait / leaky abstraction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is necessarily a bad code smell. It's used whenever we need to downcast to a FileSource
. It just reduces the # LOC
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean that sort of down casting being necessary is a code smell
fn parquet_exec_multiple_sorted( | ||
output_ordering: Vec<LexOrdering>, | ||
) -> Arc<DataSourceExec> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a problem for today but again the fact that optimizer rules have to special case execution node implementations via downcasting is IMO a smell
fn data_source_statistics(&self) -> Result<Statistics> { | ||
Ok(self.config().projected_stats(self.file_source_statistics())) | ||
fn data_source_statistics(&self) -> Statistics { | ||
let file_source_statistics = self.file_source_statistics(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't love data_source_statistics
and file_source_statistics
, curious if people had thoughts on better names here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both statistics()
s are good to me
It would be better to have a high-level diagram to describe the current relationship. Before the PR, it looks like After PR, -> ? If we have this, it'll definitely be helpful to review, especially for our users to understand the changes in the PR. Also, if we wanna have more eyes on this redesign, we can send an email to https://lists.apache.org/[email protected]. |
Fwiw, this change involves removing the Here's a diagram that captures the redesign: ![]() You can check it out here |
fn with_config(&self, config: FileScanConfig) -> Arc<dyn FileSource> { | ||
let mut this = self.clone(); | ||
this.config = config; | ||
|
||
Arc::new(this) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't find a reason why we don't give a default implementation for it in FileSource
fn config(&self) -> &FileScanConfig { | ||
&self.config |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
fn as_data_source(&self) -> Arc<dyn DataSource> { | ||
Arc::new(self.clone()) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
_partition: usize, | ||
) -> Arc<dyn FileOpener> { | ||
Arc::new(ArrowOpener { | ||
object_store, | ||
projection: base_config.file_column_projection_indices(), | ||
projection: self.config().file_column_projection_indices(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both approaches work in this case because:
- The trait method
config()
just returns&self.config
- So
self.config()
andself.config
access the same data
Use self.config()
for consistency with the trait interface, especially since:
- It maintains abstraction - using the trait method rather than direct field access
- It would work correctly even if the trait implementation changed, though it seems not to have happened
/// Return execution plan metrics | ||
fn metrics(&self) -> &ExecutionPlanMetricsSet; | ||
fn metrics_inner(&self) -> &ExecutionPlanMetricsSet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we avoid the method here? It seems that we can add metrics to FileScanConfig
let mut source = source | ||
.as_any() | ||
.downcast_ref::<ParquetSource>() | ||
.unwrap() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be better to be replaced with proper error handling
// Apply schema adapter factory before building the new config | ||
let file_source = source.apply_schema_adapter(&conf)?; | ||
|
||
let conf = FileScanConfigBuilder::from(conf) | ||
.with_source(file_source) | ||
.build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note for later reviewers, these step have been pulled up by create_file_source_with_schema_adapter
fn file_source_statistics(&self, config: &FileScanConfig) -> Statistics { | ||
let statistics = config.file_source_projected_statistics.clone(); | ||
fn file_source_statistics(&self) -> Statistics { | ||
let statistics = self.config.file_source_projected_statistics.clone(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems just a name issue.
/// Batch size while creating new batches | ||
/// Defaults to [`datafusion_common::config::ExecutionOptions`] batch_size. | ||
pub batch_size: Option<usize>, | ||
/// Expression adapter used to adapt filters and projections that are pushed down into the scan | ||
/// from the logical schema to the physical schema of the file. | ||
pub expr_adapter_factory: Option<Arc<dyn PhysicalExprAdapterFactory>>, | ||
|
||
pub file_source_projected_statistics: Statistics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
projected_statistics
is good to me
fn data_source_statistics(&self) -> Result<Statistics> { | ||
Ok(self.config().projected_stats(self.file_source_statistics())) | ||
fn data_source_statistics(&self) -> Statistics { | ||
let file_source_statistics = self.file_source_statistics(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both statistics()
s are good to me
Thanks @friendlymatthew @xudong963 for diagrams thats incredibly helpful for high level review, couple of questions:
|
Thank you for this innovative discussion! Learned a lot from it 💯 After digging a bit into related code, I got a few different opinions. First, I was wondering how
![]() Here is a brief walk through of current implementation:
Some unreasonable points to me are:
And the final one,
I'd like to propose a few changes based on current PR (as a discussion in this refactor topic, not as a review comment and might be too large for one PR).
A diagram covers above proposed changes (drawio xml embedded): ![]() |
@waynexia thanks so much for the input! I agree with most of your points, let's see what @friendlymatthew thinks
Don't we need a
FWIW I also think we should get rid of the partition value handling in |
Partition is closer to Runtime Config (passed from execution plan's exec) so it should be ok to have only one FileFormat for all partitions. The partition number is used to get a set of file segments from FilsScanConfig and to mark metrics. |
Thanks @waynexia for the diagram and explanation. For example if user would like to onboard the Orc file, For the diagram it might be still confusing having memory datasource under file source configs. Making some changes in the proposal
However as you correctly mentioned FileFormat, FileOpener, FileStream probably can be incapsulated into some facade object taking a config and providing |
Which issue does this PR close?
FileSource
s are applying projections? #17095FileScanConfigBuilder
,FileScanConfig
andFileSource
#15952Rationale for this change
This PR simplifies the relationship between
FileSource
s <-->FileScanConfig
<-->DataSource
.Currently,
FileScanConfig
is a struct used to group common parameters shared across different file sources. However, the existing design also makesFileScanConfig
implDataSource
. This means to construct a data source execution plan, you must derive it from a configuration struct.This PR removes that indirection. Instead, each
FileSource
struct holds aFileScanConfig
field, and all types implFileSource
also implDataSource
.This redesign proves to remove a lot of redundant code. For instance,
AvroSource
previously duplicated fields fromFileScanConfig
, which required additional boilerplate to manually get/set values:--
We still maintain an abstraction bounday between
FileSource
andDataSource
s. TheDataSource
impl remains generic over anyT: FileSource
Are there any user-facing changes?
Yes -- and they are substantial.
Note: the current diff does not yet include deprecation strategies for existing methods to keep the review process clearer