feat: add configurable cache mode (local_cache) with LogicalPlan::Cache (#17297) #17314

MrGranday · 2025-08-25T04:05:43Z

Which issue does this PR close?

Closes DataFrame.cache() does not work in distributed environments #17297

Rationale for this change

Currently, DataFrame.cache() always performs eager caching using an in-memory MemTable.
This works fine in local mode but causes problems in distributed environments (e.g., Ballista),
where caching should be deferred until distributed execution.

This PR introduces a configuration flag (local_cache) to support both eager and lazy caching.

What changes are included in this PR?

Added LogicalPlan::Cache { id, lineage } for lazy caching.
Updated DataFrame.cache():
- If local_cache = true → uses current eager caching with MemTable.
- If local_cache = false → returns a LogicalPlan::Cache node for lazy evaluation.
Extended physical planner to handle LogicalPlan::Cache.

Are these changes tested?

Builds on existing logical/physical plan tests.
No new dedicated distributed caching tests are included in this PR.

Are there any user-facing changes?

Yes, a new configuration option:
- datafusion.execution.local_cache = true (default) → eager caching.
- datafusion.execution.local_cache = false → lazy caching with LogicalPlan::Cache.

This is a non-breaking change because the default behavior remains unchanged.

…he (apache#17297)

milenkovicm · 2025-08-26T15:14:04Z

is this PR work in progress @MrGranday ?

milenkovicm · 2025-08-26T15:17:37Z

datafusion/core/src/dataframe/mod.rs

+        let mem_table = MemTable::try_new(schema, partitions)?;
+        context.read_table(Arc::new(mem_table))
+    } else {
+        // Lazy caching: return LogicalPlan::Cache


self.into_parts() can split the df into state and plan

Thanks I’ll use self.into_parts() to split the state and plan instead of manual cloning.

milenkovicm · 2025-08-26T15:19:19Z

datafusion/core/src/physical_planner.rs

@@ -946,6 +947,18 @@ impl DefaultPhysicalPlanner {
                ))
            }



I don't think that default planner should support LogicalPlan::Cache as it is implementation implementation specific

Makes sense I’ll remove the cache handling from DefaultPhysicalPlanner and keep it implementation-specific.

milenkovicm · 2025-08-26T15:19:51Z

datafusion/core/src/dataframe/mod.rs

-            projection_requires_validation: true,
-        }
+pub async fn cache(self) -> Result<DataFrame> {
+    if self.session_state.config.local_cache {


this should come from configuration option

Got it I’ll make sure this comes directly from a configuration option instead of hardcoding here.

milenkovicm · 2025-08-26T15:21:01Z

datafusion/core/src/dataframe/mod.rs

+        let lineage = self.to_logical_plan(); // get the logical plan so far
+        Ok(DataFrame::new(
+            (*self.session_state).clone(),
+            LogicalPlan::Cache {


LogicalPlan::Cache does not exist, so it should be created, also proto should be created as well.
I believe that apart from id and lineage we may need session_id as parameter as well, as caches are tied to session

Understood I’ll introduce a proper LogicalPlan::Cache variant with id, lineage, and also include session_id as you suggested. I’ll add the corresponding proto support too.

MrGranday · 2025-08-27T08:13:21Z

Updates in this PR:

Added DataFrame.cache() method (eager materialization into MemTable)
Reverted physical_planner.rs back to its default (no unnecessary changes)
Fixed type annotation issue in tests/parquet/mod.rs (make_uint_batches range casts)

This should fully resolve issue #17297

milenkovicm · 2025-08-27T08:22:37Z

Sorry @MrGranday i see no changes in cache logic, just comment update.

…n in parquet tests

…anday/datafusion into fix-17297-cache-distributed

MrGranday · 2025-08-27T15:59:23Z

@milenkovicm can you review it again have edited the cache fn and added the CacheNode

milenkovicm · 2025-08-27T20:36:44Z

@MrGranday one question regarding your code, do you use any of AI tools to generate this PR?

feat: add configurable cache mode (local_cache) with LogicalPlan::Cac…

4b30f65

…he (apache#17297)

github-actions bot added the core Core DataFusion crate label Aug 25, 2025

Merge branch 'main' into fix-17297-cache-distributed

6aa5563

milenkovicm reviewed Aug 26, 2025

View reviewed changes

Merge branch 'main' into fix-17297-cache-distributed

e663390

MrGranday requested a review from milenkovicm August 27, 2025 08:15

MrGranday added 4 commits August 27, 2025 03:08

Add DataFrame.cache(), revert physical_planner.rs, fix type annotatio…

524adcd

…n in parquet tests

Merge branch 'fix-17297-cache-distributed' of https://github.com/MrGr…

a61854a

…anday/datafusion into fix-17297-cache-distributed

Add DataFrame.cache(), with the CacheNode

520fbcb

Merge branch 'fix-17297-cache-distributed' of https://github.com/MrGr…

13c96c4

…anday/datafusion into fix-17297-cache-distributed

Merge branch 'main' into fix-17297-cache-distributed

73b8667

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add configurable cache mode (local_cache) with LogicalPlan::Cache (#17297) #17314

feat: add configurable cache mode (local_cache) with LogicalPlan::Cache (#17297) #17314

Uh oh!

MrGranday commented Aug 25, 2025

Uh oh!

milenkovicm commented Aug 26, 2025

Uh oh!

milenkovicm Aug 26, 2025

Uh oh!

MrGranday Aug 27, 2025

Uh oh!

milenkovicm Aug 26, 2025

Uh oh!

MrGranday Aug 27, 2025

Uh oh!

milenkovicm Aug 26, 2025

Uh oh!

MrGranday Aug 27, 2025

Uh oh!

milenkovicm Aug 26, 2025 •

edited

Loading

Uh oh!

MrGranday Aug 27, 2025

Uh oh!

MrGranday commented Aug 27, 2025

Uh oh!

milenkovicm commented Aug 27, 2025

Uh oh!

MrGranday commented Aug 27, 2025

Uh oh!

milenkovicm commented Aug 27, 2025

Uh oh!

Uh oh!

feat: add configurable cache mode (local_cache) with LogicalPlan::Cache (#17297) #17314

Are you sure you want to change the base?

feat: add configurable cache mode (local_cache) with LogicalPlan::Cache (#17297) #17314

Uh oh!

Conversation

MrGranday commented Aug 25, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

milenkovicm commented Aug 26, 2025

Uh oh!

milenkovicm Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

MrGranday Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

milenkovicm Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

MrGranday Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

milenkovicm Aug 26, 2025

Choose a reason for hiding this comment

Uh oh!

MrGranday Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

milenkovicm Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MrGranday Aug 27, 2025

Choose a reason for hiding this comment

Uh oh!

MrGranday commented Aug 27, 2025

Uh oh!

milenkovicm commented Aug 27, 2025

Uh oh!

MrGranday commented Aug 27, 2025

Uh oh!

milenkovicm commented Aug 27, 2025

Uh oh!

Uh oh!

milenkovicm Aug 26, 2025 •

edited

Loading