Skip to content

Conversation

@LiaCastaneda
Copy link
Contributor

@LiaCastaneda LiaCastaneda commented Oct 31, 2025

Which issue does this PR close?

Rationale for this change

Right now filters cannot pass through AggregateExec nodes, preventing filter pushdown optimization in queries with GROUP BY/DISTINCT operations.

What changes are included in this PR?

  • Implemented gather_filters_for_pushdown() for AggregateExec that allows filters on grouping columns to pass through to children
  • Supports both Pre phase (static filters) and Post phase (dynamic filters from joins)

Essentially, filter will pass through in the scenarios @asolimando mentioned here

Are these changes tested?

Yes, added three tests:

  • test_aggregate_filter_pushdown: Positive case with aggregate functions
  • test_no_pushdown_aggregate_filter_on_non_grouping_column: Negative case ensuring filters on aggregate results are not pushed

Are there any user-facing changes?

@github-actions github-actions bot added core Core DataFusion crate physical-plan Changes to the physical-plan crate labels Oct 31, 2025
@LiaCastaneda LiaCastaneda force-pushed the lia/pushdown-filters-through-aggregateexec branch from ea3e169 to 9b732ed Compare October 31, 2025 11:00
@LiaCastaneda LiaCastaneda force-pushed the lia/pushdown-filters-through-aggregateexec branch from a46f7bc to b977271 Compare October 31, 2025 14:21
@LiaCastaneda LiaCastaneda force-pushed the lia/pushdown-filters-through-aggregateexec branch from b977271 to f93a65a Compare October 31, 2025 14:39
@LiaCastaneda LiaCastaneda marked this pull request as ready for review October 31, 2025 14:45
Copy link
Member

@asolimando asolimando left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks great! I think it just needs maybe 1 more test and is g2g. Thank you @LiaCastaneda !

@LiaCastaneda LiaCastaneda force-pushed the lia/pushdown-filters-through-aggregateexec branch 2 times, most recently from 3ebf259 to d1ecceb Compare November 3, 2025 18:49
@LiaCastaneda LiaCastaneda force-pushed the lia/pushdown-filters-through-aggregateexec branch from d1ecceb to a858b21 Compare November 3, 2025 18:50

#[tokio::test]
async fn test_no_pushdown_filter_on_aggregate_result() {
// Test that filters on aggregate results (not grouping columns) are NOT pushed through
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❤️

@alamb
Copy link
Contributor

alamb commented Nov 3, 2025

Thanks @LiaCastaneda @Tpt @adriangb @xudong963 and @asolimando -- it is a who's who of DataFusion contributors

@alamb alamb added this pull request to the merge queue Nov 3, 2025
@alamb alamb added the performance Make DataFusion faster label Nov 3, 2025
Merged via the queue into apache:main with commit 076b091 Nov 3, 2025
32 checks passed
LiaCastaneda added a commit to DataDog/datafusion that referenced this pull request Nov 3, 2025
## Which issue does this PR close?

- Closes apache#18399

## Rationale for this change

Right now filters cannot pass through `AggregateExec` nodes, preventing
filter pushdown optimization in queries with GROUP BY/DISTINCT
operations.

## What changes are included in this PR?

- Implemented `gather_filters_for_pushdown()` for `AggregateExec` that
allows filters on grouping columns to pass through to children
- Supports both Pre phase (static filters) and Post phase (dynamic
filters from joins)

Essentially, filter will pass through in the scenarios @asolimando
mentioned
[here](apache#18399 (comment))

## Are these changes tested?

Yes, added three tests:
- `test_aggregate_filter_pushdown`: Positive case with aggregate
functions
- `test_no_pushdown_aggregate_filter_on_non_grouping_column`: Negative
case ensuring filters on aggregate results are not pushed

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

(cherry picked from commit 076b091)
LiaCastaneda added a commit to DataDog/datafusion that referenced this pull request Nov 4, 2025
* Allow filter pushdown through AggregateExec (apache#18404)

## Which issue does this PR close?

- Closes apache#18399

## Rationale for this change

Right now filters cannot pass through `AggregateExec` nodes, preventing
filter pushdown optimization in queries with GROUP BY/DISTINCT
operations.

## What changes are included in this PR?

- Implemented `gather_filters_for_pushdown()` for `AggregateExec` that
allows filters on grouping columns to pass through to children
- Supports both Pre phase (static filters) and Post phase (dynamic
filters from joins)

Essentially, filter will pass through in the scenarios @asolimando
mentioned
[here](apache#18399 (comment))

## Are these changes tested?

Yes, added three tests:
- `test_aggregate_filter_pushdown`: Positive case with aggregate
functions
- `test_no_pushdown_aggregate_filter_on_non_grouping_column`: Negative
case ensuring filters on aggregate results are not pushed

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

(cherry picked from commit 076b091)

* physical-plan: push filters down to UnionExec children (apache#18054)

Filters are safe to be pushed down, so we can override the default behavior
here.

Signed-off-by: Alfonso Subiotto Marques <[email protected]>
(cherry picked from commit 0ecd59b)

* fix: prevent UnionExec panic with empty inputs (apache#17449)

* fix: prevent UnionExec panic with empty inputs

This commit fixes a panic in UnionExec when constructed with empty inputs.
Previously, UnionExec::new(vec![]) would cause an index out of bounds panic
at union.rs:542 when trying to access inputs[0].

Changes:
- Made UnionExec::new() return Result<Self> with proper validation
- Made union_schema() return Result<SchemaRef> with empty input checks
- Added descriptive error messages for empty input cases
- Updated all call sites to handle the new Result return type
- Added comprehensive tests for edge cases

Error messages:
- "UnionExec requires at least one input"
- "Cannot create union schema from empty inputs"

The fix maintains backward compatibility for valid inputs while preventing
crashes and providing clear error messages for invalid usage.

Fixes apache#17052

* refactor: address PR review comments for UnionExec empty inputs fix

- Add new try_new method that returns Result<Arc<dyn ExecutionPlan>>
- Deprecate existing new method in favor of try_new
- Optimize single-input case: try_new returns the input directly
- Remove redundant assert!(result.is_err()) from tests
- Rename test_union_multiple_inputs_still_works to test_union_schema_multiple_inputs
- Update all call sites to use appropriate API (try_new for new code, deprecated new for tests)

This maintains backward compatibility while providing better error handling
and optimization for single-input cases.

* Fix cargo fmt and clippy warnings

- Add proper feature gates for parquet_encryption in datasource-parquet
- Format code to pass cargo fmt checks
- All tests passing

* Fix clippy

---------

Co-authored-by: Eeshan <[email protected]>
Co-authored-by: ebembi-crdb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
(cherry picked from commit b122a16)

---------

Signed-off-by: Alfonso Subiotto Marques <[email protected]>
Co-authored-by: Alfonso Subiotto Marqués <[email protected]>
Co-authored-by: EeshanBembi <[email protected]>
Co-authored-by: Eeshan <[email protected]>
Co-authored-by: ebembi-crdb <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Core DataFusion crate performance Make DataFusion faster physical-plan Changes to the physical-plan crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Allow Dynamic Filters Filters to pass through AggregateExec

7 participants