Wire up ConfigOptions #34

Omega359 · 2025-07-27T22:10:05Z

Wire up config options using your POC. Tests pass locally.

Two main points:

AsyncScalarUDF.invoke_with_args breaking change - remove config_options args.
ffi/udf/mod.rs - two todo's in there related to the addition of ConfigOptions into ScalarFunctionArgs. I have no clue how this should work but I expect what is stubbed in there isn't it.

Bumps [tokio](https://github.com/tokio-rs/tokio) from 1.45.1 to 1.46.0. - [Release notes](https://github.com/tokio-rs/tokio/releases) - [Commits](tokio-rs/tokio@tokio-1.45.1...tokio-1.46.0) --- updated-dependencies: - dependency-name: tokio dependency-version: 1.46.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

…pt limit pushdown (apache#16641) Co-authored-by: Adrian Garcia Badaracco <[email protected]>

…16615) * Convert Option<Vec<sort expression>> to Vec<sort expression> * clippy * fix comment * fix doc * change back to Expr * remove redundant check

) * Improve error message when ScalarValue fails to cast array The `as_*_array` functions and the `downcast_value!` have the benefit of reporting the array type when there is a mismatch. This makes the error message more actionable. * test

* Add an example of embedding indexes inside a parquet file * Add page image * Add prune file example * Fix clippy * polish code * Fmt * address comments * Add debug * Add new example, but it will fail with page index * add debug * add debug * polish * debug * Using low level API to support * polish * fix * merge * fix * complte solution * polish comments * adjust image * add comments part 1 * pin to new arrow-rs * pin to new arrow-rs * add comments part 2 * merge upstream * merge upstream * polish code * Rename example and add it to the list * Work on comments * More documentation * Documentation obession, encapsulate example * Update datafusion-examples/examples/parquet_embedded_index.rs Co-authored-by: Sherin Jacob <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Sherin Jacob <[email protected]>

* Implementation for regex_instr * linting and typo addressed in bench * prettier formatting * scalar_functions_formatting * linting format macros * formatting * address comments to PR * formatting * clippy * fmt * address docs typo * remove unnecessary struct and comment * delete redundant lines add tests for subexp correct function signature for benches * refactor get_index * comments addressed * update doc * clippy upgrade --------- Co-authored-by: Nirnay Roy <[email protected]> Co-authored-by: Andrew Lamb <[email protected]> Co-authored-by: Dmitrii Blaginin <[email protected]>

…nts (apache#16672) - Refactored the `DataFusionError` enum to use `Box<T>` for: - `ArrowError` - `ParquetError` - `AvroError` - `object_store::Error` - `ParserError` - `SchemaError` - `JoinError` - Updated all relevant match arms and constructors to handle boxed errors. - Refactored error-related macros (`arrow_datafusion_err!`, `sql_datafusion_err!`, etc.) to use `Box<T>`. - Adjusted test cases and error assertions for boxed variants. - Documentation update to the upgrade guide to explain the required changes and rationale.

…on and Mapping (apache#16583) - Introduced a new `schema_adapter_factory` field in `ListingTableConfig` and `ListingTable` - Added `with_schema_adapter_factory` and `schema_adapter_factory()` methods to both structs - Modified execution planning logic to apply schema adapters during scanning - Updated statistics collection to use mapped schemas - Implemented detailed documentation and example usage in doc comments - Added new unit and integration tests validating schema adapter behavior and error cases

* Reuse Rows in RowCursorStream * WIP * Fmt * Add comment, make it backwards compatible * Add comment, make it backwards compatible * Add comment, make it backwards compatible * Clippy * Clippy * Return error on non-unique reference * Comment * Update datafusion/physical-plan/src/sorts/stream.rs Co-authored-by: Oleks V <[email protected]> * Fix * Extract logic * Doc fix --------- Co-authored-by: Oleks V <[email protected]>

apache#16630) * Perf: fast CursorValues compare for StringViewArray using inline_key_fast * fix * polish * polish * add test --------- Co-authored-by: Daniël Heres <[email protected]>

One step towards apache#16652. Co-authored-by: Oleks V <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

Co-authored-by: Daniël Heres <[email protected]>

* Refactor StreamJoinMetrics to reuse BaselineMetrics Signed-off-by: Alan Tang <[email protected]> * use the record_poll method to update output rows Signed-off-by: Alan Tang <[email protected]> --------- Signed-off-by: Alan Tang <[email protected]>

* Remove unused AggregateUDF struct * Fix docs --------- Co-authored-by: Andrew Lamb <[email protected]>

…che#16500) * chore: refactor `BuildProbeJoinMetrics` to use `BaselineMetrics` Closes apache#16495 Here's an example of an `explain analyze` of a hash join showing these metrics: ``` [(WatchID@0, WatchID@0)], metrics=[output_rows=100, elapsed_compute=2.313624ms, build_input_batches=1, build_input_rows=100, input_batches=1, input_rows=100, output_batches=1, build_mem_used=3688, build_time=865.832µs, join_time=1.369875ms] ``` Notice `output_rows=100, elapsed_compute=2.313624ms` in the above. * test: add checks for join metrics in tests * fix: add record_poll to ExhaustedProbeSide for nested_loop_join This was needed because ExhaustedProbeSide state can also return output rows - in certain types of joins. Without this, the output_rows metric for nested loop join was wrong!

* Use compression type in file suffices - Add FileFormat::compression_type method - Specify meaningful values for CSV only - Use compression type as a part of extension for files * Add CSV tests * Add glob dep, use env logging * Use a glob pattern with compression suffix for TableProviderFactory * Conform to clippy standards --------- Co-authored-by: Andrew Lamb <[email protected]>

* Refactor SortMergeJoinMetrics to reuse BaselineMetrics Signed-off-by: Alan Tang <[email protected]> * use record_poll method to update output_rows Signed-off-by: Alan Tang <[email protected]> * chore: Replace replace_poll with replace_output Signed-off-by: Alan Tang <[email protected]> --------- Signed-off-by: Alan Tang <[email protected]>

* Add support for Arrow Dictionary type in Substrait This commit adds support for the Arrow Dictionary type in Substrait plans. Resolves apache#16273 * Add more specific type variation consts

* fix sqllogictest condition mismatch * Update test file condition * revert changes in sqllogictests --------- Co-authored-by: Leon Lin <[email protected]>

…ring physical planning (apache#16454) * Fix duplicates on Join creation during physcial planning * Add Substrait reproducer * Better error message & more doc * Handle case for right/left/full joins as well

--- updated-dependencies: - dependency-name: tokio dependency-version: 1.46.1 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Add reproducer for tpch Q16 deserialization bug * Add small Parquet file with 20 rows from part table for testing * Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * Make the test fail and ignore it until the bug is fixed per review comments * fix clippy --------- Co-authored-by: Andrew Lamb <[email protected]>

…6706)

This commit refactors the filter pushdown infrastructure to improve flexibility, readability, and maintainability: ### Major Changes: - **Eliminated `PredicateSupports`** wrapper in favor of directly using `Vec<PredicateSupport>`, simplifying APIs. - Introduced **`ChildFilterDescription::from_child`** to encapsulate logic for determining filter pushdown eligibility per child. - Added **`FilterDescription::from_children`** to generate pushdown plans based on column references across all children. - Replaced legacy methods (`all_parent_filters_supported`, etc.) with more flexible, composable APIs using builder-style chaining. - Updated all relevant nodes (`FilterExec`, `SortExec`, `RepartitionExec`, etc.) to use the new pushdown planning structure. ### Functional Adjustments: - Ensured filter column indices are reassigned properly when filters are pushed to projected inputs (e.g., in `FilterExec`). - Standardized handling of supported vs. unsupported filters throughout the propagation pipeline. - Improved handling of self-filters in nodes such as `FilterExec` and `SortExec`. ### Optimizer Improvements: - Clarified pushdown phases (`Pre`, `Post`) and respected them across execution plans. - Documented the full pushdown lifecycle within `filter_pushdown.rs`, improving discoverability for future contributors. These changes lay the groundwork for more precise and flexible filter pushdown optimizations and improve the robustness of the optimizer infrastructure.

…e#16901)

…ns (apache#16858)

…6848) * feat(spark): Implement Spark luhn_check function Signed-off-by: Alan Tang <[email protected]> * test(spark): add more tests Signed-off-by: Alan Tang <[email protected]> * feat(spark): set the signature to be Utf8 type Signed-off-by: Alan Tang <[email protected]> * chore: add more types for luhn_check function Signed-off-by: Alan Tang <[email protected]> * test: add test for array input Signed-off-by: Alan Tang <[email protected]> --------- Signed-off-by: Alan Tang <[email protected]>

Bumps [aws-config](https://github.com/smithy-lang/smithy-rs) from 1.8.2 to 1.8.3. - [Release notes](https://github.com/smithy-lang/smithy-rs/releases) - [Changelog](https://github.com/smithy-lang/smithy-rs/blob/main/CHANGELOG.md) - [Commits](https://github.com/smithy-lang/smithy-rs/commits) --- updated-dependencies: - dependency-name: aws-config dependency-version: 1.8.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <[email protected]> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Derive UDF equality from PartialEq, Hash Reduce boilerplate in cases where implementation of `{ScalarUDFImpl,AggregateUDFImpl,WindowUDFImpl}::{equals,hash_code}` can be derived using standard `PartialEq` and `Hash` traits. This is code complexity reduction. While valuable on its own, this also prepares for more automatic derivation of UDF equals/hash and/or removal of default implementations (which currently are error-prone). * udf_equals_hash example * test udf_equals_hash * empty: roll the dice 🎲

…che#16857) * Handle expression and value elements in Substrait VirtualTable * Added test * Modified test plan, changed conditional check for clarity * Consume expressions with empty input schema

* Update common.rs * Update common.rs

* feat(spark): implement Spark datetime function last_day Signed-off-by: Alan Tang <[email protected]> * chore: fix the export function name Signed-off-by: Alan Tang <[email protected]> * chore: Fix Cargo.toml formatting Signed-off-by: Alan Tang <[email protected]> * test: add more tests for spark function last_day Signed-off-by: Alan Tang <[email protected]> * feat(spark): set the signature to be taking exactly one Date32 type Signed-off-by: Alan Tang <[email protected]> * test(spark): add more bad cases Signed-off-by: Alan Tang <[email protected]> * chore: clean up redundant package Signed-off-by: Alan Tang <[email protected]> --------- Signed-off-by: Alan Tang <[email protected]>

…ables (apache#16875) Fixes apache#16874

* def-min-max * Update mod.rs

* Update interval_arithmetic.rs * Update interval_arithmetic.rs * Update interval_arithmetic.rs

…he#16897) * minor * Update aggregate.rs

…for `Decimal128` and `Decimal256` (apache#16831) * Add missing ScalarValue impls for large decimals Add methods distance, new_zero, new_one, new_ten for Decimal128, Decimal256 * Support expr simplication for Decimal256 * Replace lookup table with i128::pow * Support different scales for Decimal constructors - Allow to construct one and ten with different scales - Add tests for new_one, new_ten - Add test for distance * Revert "Replace lookup table with i128::pow" This reverts commit ba23e8c. * Use Arrow error directly

* Update value.rs * Update value.rs * Update value.rs * Update datafusion/physical-plan/src/metrics/value.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/physical-plan/src/metrics/value.rs Co-authored-by: Andrew Lamb <[email protected]> * Update datafusion/physical-plan/src/metrics/value.rs Co-authored-by: Andrew Lamb <[email protected]> --------- Co-authored-by: Andrew Lamb <[email protected]>

* Update partial_sort.rs * Update partial_sort.rs * Update partial_sort.rs * add sql test

* fix: skip predicates on struct unnest in FilterPushdown * doc comments * fix

Signed-off-by: Ruihang Xia <[email protected]>

…_options

…ke_async_with_args to remove ConfigOptions arg.

* checkpoint: Address PR feedback in https://github.com/apach... * add FilteredVec to consolidate handling of filter / remap pattern

Omega359 · 2025-07-28T20:47:34Z

I am not sure why github decided that merging in main should be in the commit list but the changes I made were in 547bcac

Note that I realized today that there are no tests covering the actual new functionality which I'll add this week

alamb · 2025-07-28T21:27:58Z

(note this appears to be a PR on to my fork -- we should probably rebase / retarget to main in the apache repo)

Omega359 · 2025-07-28T21:42:58Z

(note this appears to be a PR on to my fork -- we should probably rebase / retarget to main in the apache repo)

Yup. I don't know how to add to an existing PR in github (Since I am not a committer I doubt I can anyways). Best I know how to do is either submit a PR with both of our commits in it, have another PR that requires yours to be approved first, or you rebase your branch and pull in my commit. If there is another option that you would like to do let me know

alamb · 2025-07-28T21:56:00Z

(note this appears to be a PR on to my fork -- we should probably rebase / retarget to main in the apache repo)

Yup. I don't know how to add to an existing PR in github (Since I am not a committer I doubt I can anyways).

if you are a committer to the repo in question, you can just push commits to the remote fork (if the PR author has clicked the box "allow maintainers to edit PR")

Best I know how to do is either submit a PR with both of our commits in it, have another PR that requires yours to be approved first, or you rebase your branch and pull in my commit. If there is another option that you would like to do let me know

I personally suggest you make a new PR on the apache repo that has whatever commits are needed and we'll close up these POCs

Omega359 · 2025-07-29T18:08:08Z

Submitted PR to datafusion - apache#16970

* dissallow pushdown of volatile PhysicalExprs * fix * add FilteredVec helper to handle filter / remap pattern (#34) * checkpoint: Address PR feedback in https://github.com/apach... * add FilteredVec to consolidate handling of filter / remap pattern * lint * Add slt test for pushing volatile predicates down (apache#35) --------- Co-authored-by: Andrew Lamb <[email protected]>

zhuqi-lucas-001 and others added 30 commits July 3, 2025 14:32

benchmark: Support sort_tpch10 for benchmark (apache#16671)

3ca09a6

Fix TopK Sort incorrectly pushed down past operators that do not acce…

06e5bbe

…pt limit pushdown (apache#16641) Co-authored-by: Adrian Garcia Badaracco <[email protected]>

Convert Option<Vec<sort expression>> to Vec<sort expression> (apache#…

50dc83a

…16615) * Convert Option<Vec<sort expression>> to Vec<sort expression> * clippy * fix comment * fix doc * change back to Expr * remove redundant check

datafusion-cli: Refactor statement execution logic (apache#16634)

1cc67ab

Perf: fast CursorValues compare for StringViewArray using inline_key_… (

0185da6

apache#16630) * Perf: fast CursorValues compare for StringViewArray using inline_key_fast * fix * polish * polish * add test --------- Co-authored-by: Daniël Heres <[email protected]>

refactor: shrink SchemaError (apache#16653)

a715173

One step towards apache#16652. Co-authored-by: Oleks V <[email protected]> Co-authored-by: Andrew Lamb <[email protected]>

rustup version (apache#16663)

aadb79b

Co-authored-by: Daniël Heres <[email protected]>

Remove unused AggregateUDF struct (apache#16683)

12c40ca

* Remove unused AggregateUDF struct * Fix docs --------- Co-authored-by: Andrew Lamb <[email protected]>

Clarify the generality of the embedded parquet index (apache#16692)

1e7c3a1

Add support for Arrow Dictionary type in Substrait (apache#16608)

d359d64

* Add support for Arrow Dictionary type in Substrait This commit adds support for the Arrow Dictionary type in Substrait plans. Resolves apache#16273 * Add more specific type variation consts

fix: sqllogictest runner label condition mismatch (apache#16633)

e162ec5

* fix sqllogictest condition mismatch * Update test file condition * revert changes in sqllogictests --------- Co-authored-by: Leon Lin <[email protected]>

Fix duplicate field name error in Join::try_new_with_project_input du…

ec15558

…ring physical planning (apache#16454) * Fix duplicates on Join creation during physcial planning * Add Substrait reproducer * Better error message & more doc * Handle case for right/left/full joins as well

fix: port arrow inline fast key fix to datafusion (apache#16698)

e95d038

Update Upgrade Guide for 48.0.1 (apache#16699)

ebb8e95

optimize ScalarValue::to_array_of_size for structural types (apache#1…

5db5fbc

…6706)

Update release instructions (apache#16701)

e950df5

adriangb and others added 22 commits July 24, 2025 15:21

remove deprecated methods from FileScanConfig / DataSourceExec (apach…

d9e963e

…e#16901)

Support utf8view for spark hex (apache#16885)

3b4eda5

Fixes 3 bugs during serialization and deserialization of physical pla…

a6d4798

…ns (apache#16858)

Ensure Substrait consumer can handle expressions in VirtualTable (apa…

675b96c

…che#16857) * Handle expression and value elements in Substrait VirtualTable * Added test * Modified test plan, changed conditional check for clarity * Consume expressions with empty input schema

Mutable Join Unwind (apache#16883)

8b9204c

* Update common.rs * Update common.rs

fix(datafusion-proto): support serializing/deserilizing ArrowFormat t…

5e0b2d0

…ables (apache#16875) Fixes apache#16874

ScalarValue Default + Min + Max (apache#16891)

871d4b5

* def-min-max * Update mod.rs

fix: PlaceholderRowExec::partition_statistics (apache#16851)

4f53358

minor: add is_superset() method for Interval's (apache#16895)

84c8881

* Update interval_arithmetic.rs * Update interval_arithmetic.rs * Update interval_arithmetic.rs

minor: implement with_new_expressions for AggregateFunctionExpr (apac…

9deec2a

…he#16897) * minor * Update aggregate.rs

Update enforce_distribution.rs (apache#16913)

cbda394

Fix Partial Sort Get Slice Point Between Batches (apache#16881)

8bf7123

* Update partial_sort.rs * Update partial_sort.rs * Update partial_sort.rs * add sql test

fix: skip predicates on struct unnest in PushDownFilter (apache#16790)

71dbdf6

* fix: skip predicates on struct unnest in FilterPushdown * doc comments * fix

optimize initcap function by avoiding memory allocation (apache#16878)

e033d42

Signed-off-by: Ruihang Xia <[email protected]>

Merge remote-tracking branch 'upstream/main' into alamb/thread_config…

1cb7bcb

…_options

Add ConfigOptions to ScalarFunctionArgs, refactor AsyncScalarUDF.invo…

547bcac

…ke_async_with_args to remove ConfigOptions arg.

alamb pushed a commit that referenced this pull request Jul 28, 2025

add FilteredVec helper to handle filter / remap pattern (#34)

4c1b0e2

* checkpoint: Address PR feedback in https://github.com/apach... * add FilteredVec to consolidate handling of filter / remap pattern

Omega359 closed this Jul 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wire up ConfigOptions #34

Wire up ConfigOptions #34

Uh oh!

Omega359 commented Jul 27, 2025

Uh oh!

Omega359 commented Jul 28, 2025

Uh oh!

alamb commented Jul 28, 2025 •

edited

Loading

Uh oh!

Omega359 commented Jul 28, 2025

Uh oh!

alamb commented Jul 28, 2025

Uh oh!

Omega359 commented Jul 29, 2025

Uh oh!

Uh oh!

Wire up ConfigOptions #34

Wire up ConfigOptions #34

Uh oh!

Conversation

Omega359 commented Jul 27, 2025

Uh oh!

Omega359 commented Jul 28, 2025

Uh oh!

alamb commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Omega359 commented Jul 28, 2025

Uh oh!

alamb commented Jul 28, 2025

Uh oh!

Omega359 commented Jul 29, 2025

Uh oh!

Uh oh!

alamb commented Jul 28, 2025 •

edited

Loading