Skip to content

Conversation

Omega359
Copy link

Wire up config options using your POC. Tests pass locally.

Two main points:

  • AsyncScalarUDF.invoke_with_args breaking change - remove config_options args.
  • ffi/udf/mod.rs - two todo's in there related to the addition of ConfigOptions into ScalarFunctionArgs. I have no clue how this should work but I expect what is stubbed in there isn't it.

zhuqi-lucas-001 and others added 30 commits July 3, 2025 14:32
Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.45.1 to 1.46.0.
- [Release notes](https://github.com/tokio-rs/tokio/releases)
- [Commits](tokio-rs/tokio@tokio-1.45.1...tokio-1.46.0)

---
updated-dependencies:
- dependency-name: tokio
  dependency-version: 1.46.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
…16615)

* Convert Option<Vec<sort expression>> to Vec<sort expression>

* clippy

* fix comment

* fix doc

* change back to Expr

* remove redundant check
)

* Improve error message when ScalarValue fails to cast array

The `as_*_array` functions and the `downcast_value!` have the benefit of
reporting the array type when there is a mismatch. This makes the error
message more actionable.

* test
* Add an example of embedding indexes inside a parquet file

* Add page image

* Add prune file example

* Fix clippy

* polish code

* Fmt

* address comments

* Add debug

* Add new example, but it will fail with page index

* add debug

* add debug

* polish

* debug

* Using low level API to support

* polish

* fix

* merge

* fix

* complte solution

* polish comments

* adjust image

* add comments part 1

* pin to new arrow-rs

* pin to new arrow-rs

* add comments part 2

* merge upstream

* merge upstream

* polish code

* Rename example and add it to the list

* Work on comments

* More documentation

* Documentation obession, encapsulate example

* Update datafusion-examples/examples/parquet_embedded_index.rs

Co-authored-by: Sherin Jacob <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Sherin Jacob <[email protected]>
* Implementation for regex_instr

* linting and typo addressed in bench

* prettier formatting

* scalar_functions_formatting

* linting format macros

* formatting

* address comments to PR

* formatting

* clippy

* fmt

* address docs typo

* remove unnecessary struct and comment

* delete redundant lines
add tests for subexp
correct function signature for benches

* refactor get_index

* comments addressed

* update doc

* clippy upgrade

---------

Co-authored-by: Nirnay Roy <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Dmitrii Blaginin <[email protected]>
…nts (apache#16672)

- Refactored the `DataFusionError` enum to use `Box<T>` for:
  - `ArrowError`
  - `ParquetError`
  - `AvroError`
  - `object_store::Error`
  - `ParserError`
  - `SchemaError`
  - `JoinError`
- Updated all relevant match arms and constructors to handle boxed errors.
- Refactored error-related macros (`arrow_datafusion_err!`, `sql_datafusion_err!`, etc.) to use `Box<T>`.
- Adjusted test cases and error assertions for boxed variants.
- Documentation update to the upgrade guide to explain the required changes and rationale.
…on and Mapping (apache#16583)

- Introduced a new `schema_adapter_factory` field in `ListingTableConfig` and `ListingTable`
- Added `with_schema_adapter_factory` and `schema_adapter_factory()` methods to both structs
- Modified execution planning logic to apply schema adapters during scanning
- Updated statistics collection to use mapped schemas
- Implemented detailed documentation and example usage in doc comments
- Added new unit and integration tests validating schema adapter behavior and error cases
* Reuse Rows in RowCursorStream

* WIP

* Fmt

* Add comment, make it backwards compatible

* Add comment, make it backwards compatible

* Add comment, make it backwards compatible

* Clippy

* Clippy

* Return error on non-unique reference

* Comment

* Update datafusion/physical-plan/src/sorts/stream.rs

Co-authored-by: Oleks V <[email protected]>

* Fix

* Extract logic

* Doc fix

---------

Co-authored-by: Oleks V <[email protected]>
apache#16630)

* Perf: fast CursorValues compare for StringViewArray using inline_key_fast

* fix

* polish

* polish

* add test

---------

Co-authored-by: Daniël Heres <[email protected]>
One step towards apache#16652.

Co-authored-by: Oleks V <[email protected]>
Co-authored-by: Andrew Lamb <[email protected]>
Co-authored-by: Daniël Heres <[email protected]>
* Refactor StreamJoinMetrics to reuse BaselineMetrics

Signed-off-by: Alan Tang <[email protected]>

* use the record_poll method to update output rows

Signed-off-by: Alan Tang <[email protected]>

---------

Signed-off-by: Alan Tang <[email protected]>
* Remove unused AggregateUDF struct

* Fix docs

---------

Co-authored-by: Andrew Lamb <[email protected]>
…che#16500)

* chore: refactor `BuildProbeJoinMetrics` to use `BaselineMetrics`

Closes apache#16495

Here's an example of an `explain analyze` of a hash join showing these metrics:
```
[(WatchID@0, WatchID@0)], metrics=[output_rows=100, elapsed_compute=2.313624ms, build_input_batches=1, build_input_rows=100, input_batches=1, input_rows=100, output_batches=1, build_mem_used=3688, build_time=865.832µs, join_time=1.369875ms]
```

Notice `output_rows=100, elapsed_compute=2.313624ms` in the above.

* test: add checks for join metrics in tests

* fix: add record_poll to ExhaustedProbeSide for nested_loop_join

This was needed because ExhaustedProbeSide state can also return output
rows - in certain types of joins. Without this, the output_rows metric
for nested loop join was wrong!
* Use compression type in file suffices

- Add FileFormat::compression_type method
- Specify meaningful values for CSV only
- Use compression type as a part of extension for files

* Add CSV tests

* Add glob dep, use env logging

* Use a glob pattern with compression suffix for TableProviderFactory

* Conform to clippy standards

---------

Co-authored-by: Andrew Lamb <[email protected]>
* Refactor SortMergeJoinMetrics to reuse BaselineMetrics

Signed-off-by: Alan Tang <[email protected]>

* use record_poll method to update output_rows

Signed-off-by: Alan Tang <[email protected]>

* chore: Replace replace_poll with replace_output

Signed-off-by: Alan Tang <[email protected]>

---------

Signed-off-by: Alan Tang <[email protected]>
* Add support for Arrow Dictionary type in Substrait

This commit adds support for the Arrow Dictionary type in Substrait
plans.

Resolves apache#16273

* Add more specific type variation consts
* fix sqllogictest condition mismatch

* Update test file condition

* revert changes in sqllogictests

---------

Co-authored-by: Leon Lin <[email protected]>
…ring physical planning (apache#16454)

* Fix duplicates on Join creation during physcial planning

* Add Substrait reproducer

* Better error message & more doc

* Handle case for right/left/full joins as well
---
updated-dependencies:
- dependency-name: tokio
  dependency-version: 1.46.1
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Add reproducer for tpch Q16 deserialization bug

* Add small Parquet file with 20 rows from part table for testing

* Apply suggestions from code review

Co-authored-by: Andrew Lamb <[email protected]>

* Make the test fail and ignore it until the bug is fixed per review comments

* fix clippy

---------

Co-authored-by: Andrew Lamb <[email protected]>
This commit refactors the filter pushdown infrastructure to improve flexibility, readability, and maintainability:

### Major Changes:
- **Eliminated `PredicateSupports`** wrapper in favor of directly using `Vec<PredicateSupport>`, simplifying APIs.
- Introduced **`ChildFilterDescription::from_child`** to encapsulate logic for determining filter pushdown eligibility per child.
- Added **`FilterDescription::from_children`** to generate pushdown plans based on column references across all children.
- Replaced legacy methods (`all_parent_filters_supported`, etc.) with more flexible, composable APIs using builder-style chaining.
- Updated all relevant nodes (`FilterExec`, `SortExec`, `RepartitionExec`, etc.) to use the new pushdown planning structure.

### Functional Adjustments:
- Ensured filter column indices are reassigned properly when filters are pushed to projected inputs (e.g., in `FilterExec`).
- Standardized handling of supported vs. unsupported filters throughout the propagation pipeline.
- Improved handling of self-filters in nodes such as `FilterExec` and `SortExec`.

### Optimizer Improvements:
- Clarified pushdown phases (`Pre`, `Post`) and respected them across execution plans.
- Documented the full pushdown lifecycle within `filter_pushdown.rs`, improving discoverability for future contributors.

These changes lay the groundwork for more precise and flexible filter pushdown optimizations and improve the robustness of the optimizer infrastructure.
adriangb and others added 22 commits July 24, 2025 15:21
…6848)

* feat(spark): Implement Spark luhn_check function

Signed-off-by: Alan Tang <[email protected]>

* test(spark): add more tests

Signed-off-by: Alan Tang <[email protected]>

* feat(spark): set the signature to be Utf8 type

Signed-off-by: Alan Tang <[email protected]>

* chore: add more types for luhn_check function

Signed-off-by: Alan Tang <[email protected]>

* test: add test for array input

Signed-off-by: Alan Tang <[email protected]>

---------

Signed-off-by: Alan Tang <[email protected]>
Bumps [aws-config](https://github.com/smithy-lang/smithy-rs) from 1.8.2 to 1.8.3.
- [Release notes](https://github.com/smithy-lang/smithy-rs/releases)
- [Changelog](https://github.com/smithy-lang/smithy-rs/blob/main/CHANGELOG.md)
- [Commits](https://github.com/smithy-lang/smithy-rs/commits)

---
updated-dependencies:
- dependency-name: aws-config
  dependency-version: 1.8.3
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <[email protected]>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
* Derive UDF equality from PartialEq, Hash

Reduce boilerplate in cases where implementation of
`{ScalarUDFImpl,AggregateUDFImpl,WindowUDFImpl}::{equals,hash_code}` can
be derived using standard `PartialEq` and `Hash` traits.

This is code complexity reduction.

While valuable on its own, this also prepares for more automatic
derivation of UDF equals/hash and/or removal of default implementations
(which currently are error-prone).

* udf_equals_hash example

* test udf_equals_hash

* empty: roll the dice 🎲
…che#16857)

* Handle expression and value elements in Substrait VirtualTable

* Added test

* Modified test plan, changed conditional check for clarity

* Consume expressions with empty input schema
* Update common.rs

* Update common.rs
* feat(spark): implement Spark datetime function last_day

Signed-off-by: Alan Tang <[email protected]>

* chore: fix the export function name

Signed-off-by: Alan Tang <[email protected]>

* chore: Fix Cargo.toml formatting

Signed-off-by: Alan Tang <[email protected]>

* test: add more tests for spark function last_day

Signed-off-by: Alan Tang <[email protected]>

* feat(spark): set the signature to be taking exactly one Date32 type

Signed-off-by: Alan Tang <[email protected]>

* test(spark): add more bad cases

Signed-off-by: Alan Tang <[email protected]>

* chore: clean up redundant package

Signed-off-by: Alan Tang <[email protected]>

---------

Signed-off-by: Alan Tang <[email protected]>
* def-min-max

* Update mod.rs
* Update interval_arithmetic.rs

* Update interval_arithmetic.rs

* Update interval_arithmetic.rs
…for `Decimal128` and `Decimal256` (apache#16831)

* Add missing ScalarValue impls for large decimals

Add methods distance, new_zero, new_one, new_ten for Decimal128,
Decimal256

* Support expr simplication for Decimal256

* Replace lookup table with i128::pow

* Support different scales for Decimal constructors

- Allow to construct one and ten with different scales
- Add tests for new_one, new_ten
- Add test for distance

* Revert "Replace lookup table with i128::pow"

This reverts commit ba23e8c.

* Use Arrow error directly
* Update value.rs

* Update value.rs

* Update value.rs

* Update datafusion/physical-plan/src/metrics/value.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update datafusion/physical-plan/src/metrics/value.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update datafusion/physical-plan/src/metrics/value.rs

Co-authored-by: Andrew Lamb <[email protected]>

---------

Co-authored-by: Andrew Lamb <[email protected]>
* Update partial_sort.rs

* Update partial_sort.rs

* Update partial_sort.rs

* add sql test
* fix: skip predicates on struct unnest in FilterPushdown

* doc comments

* fix
…ke_async_with_args to remove ConfigOptions arg.
alamb pushed a commit that referenced this pull request Jul 28, 2025
* checkpoint: Address PR feedback in https://github.com/apach...

* add FilteredVec to consolidate handling of filter / remap pattern
@Omega359
Copy link
Author

I am not sure why github decided that merging in main should be in the commit list but the changes I made were in 547bcac

Note that I realized today that there are no tests covering the actual new functionality which I'll add this week

@alamb
Copy link
Owner

alamb commented Jul 28, 2025

(note this appears to be a PR on to my fork -- we should probably rebase / retarget to main in the apache repo)

@Omega359
Copy link
Author

(note this appears to be a PR on to my fork -- we should probably rebase / retarget to main in the apache repo)

Yup. I don't know how to add to an existing PR in github (Since I am not a committer I doubt I can anyways). Best I know how to do is either submit a PR with both of our commits in it, have another PR that requires yours to be approved first, or you rebase your branch and pull in my commit. If there is another option that you would like to do let me know

@alamb
Copy link
Owner

alamb commented Jul 28, 2025

(note this appears to be a PR on to my fork -- we should probably rebase / retarget to main in the apache repo)

Yup. I don't know how to add to an existing PR in github (Since I am not a committer I doubt I can anyways).

if you are a committer to the repo in question, you can just push commits to the remote fork (if the PR author has clicked the box "allow maintainers to edit PR")

Best I know how to do is either submit a PR with both of our commits in it, have another PR that requires yours to be approved first, or you rebase your branch and pull in my commit. If there is another option that you would like to do let me know

I personally suggest you make a new PR on the apache repo that has whatever commits are needed and we'll close up these POCs

@Omega359
Copy link
Author

Submitted PR to datafusion - apache#16970

@Omega359 Omega359 closed this Jul 29, 2025
alamb added a commit that referenced this pull request Jul 31, 2025
* dissallow pushdown of volatile PhysicalExprs

* fix

* add FilteredVec helper to handle filter / remap pattern (#34)

* checkpoint: Address PR feedback in https://github.com/apach...

* add FilteredVec to consolidate handling of filter / remap pattern

* lint

* Add slt test for pushing volatile predicates down (apache#35)

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.