-
Notifications
You must be signed in to change notification settings - Fork 1.6k
feat: add configurable cache mode (local_cache) with LogicalPlan::Cache (#17297) #17314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
is this PR work in progress @MrGranday ? |
datafusion/core/src/dataframe/mod.rs
Outdated
let mem_table = MemTable::try_new(schema, partitions)?; | ||
context.read_table(Arc::new(mem_table)) | ||
} else { | ||
// Lazy caching: return LogicalPlan::Cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
self.into_parts() can split the df into state and plan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I’ll use self.into_parts()
to split the state and plan instead of manual cloning.
@@ -946,6 +947,18 @@ impl DefaultPhysicalPlanner { | |||
)) | |||
} | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that default planner should support LogicalPlan::Cache
as it is implementation implementation specific
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense I’ll remove the cache handling from DefaultPhysicalPlanner
and keep it implementation-specific.
datafusion/core/src/dataframe/mod.rs
Outdated
projection_requires_validation: true, | ||
} | ||
pub async fn cache(self) -> Result<DataFrame> { | ||
if self.session_state.config.local_cache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should come from configuration option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it I’ll make sure this comes directly from a configuration option instead of hardcoding here.
datafusion/core/src/dataframe/mod.rs
Outdated
let lineage = self.to_logical_plan(); // get the logical plan so far | ||
Ok(DataFrame::new( | ||
(*self.session_state).clone(), | ||
LogicalPlan::Cache { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LogicalPlan::Cache does not exist, so it should be created, also proto should be created as well.
I believe that apart from id and lineage we may need session_id as parameter as well, as caches are tied to session
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood I’ll introduce a proper LogicalPlan::Cache
variant with id, lineage, and also include session_id
as you suggested. I’ll add the corresponding proto support too.
Updates in this PR:
This should fully resolve issue #17297 |
Sorry @MrGranday i see no changes in cache logic, just comment update. |
…n in parquet tests
…anday/datafusion into fix-17297-cache-distributed
…anday/datafusion into fix-17297-cache-distributed
@milenkovicm can you review it again have edited the |
@MrGranday one question regarding your code, do you use any of AI tools to generate this PR? |
Which issue does this PR close?
DataFrame.cache()
does not work in distributed environments #17297Rationale for this change
Currently,
DataFrame.cache()
always performs eager caching using an in-memoryMemTable
.This works fine in local mode but causes problems in distributed environments (e.g., Ballista),
where caching should be deferred until distributed execution.
This PR introduces a configuration flag (
local_cache
) to support both eager and lazy caching.What changes are included in this PR?
LogicalPlan::Cache { id, lineage }
for lazy caching.DataFrame.cache()
:local_cache = true
→ uses current eager caching withMemTable
.local_cache = false
→ returns aLogicalPlan::Cache
node for lazy evaluation.LogicalPlan::Cache
.Are these changes tested?
Are there any user-facing changes?
datafusion.execution.local_cache = true
(default) → eager caching.datafusion.execution.local_cache = false
→ lazy caching withLogicalPlan::Cache
.This is a non-breaking change because the default behavior remains unchanged.