Skip to content

Conversation

Omega359
Copy link
Contributor

@Omega359 Omega359 commented Jul 29, 2025

Which issue does this PR close?

Closes

This is a continuation of @alamb's PR

Relates to:

Alternative to:

Rationale for this change

Allow udf's to access df config to allow for their behaviour to change based on configuration. For example, allows date and timestamp udf's to use a different timezone than UTC or to allow date/timestamp parsing to have ANSI behaviour when parsing fails.

What changes are included in this PR?

  • Add an Arc<ConfigOptions> to ExecutionProps when the execution is started
  • SessionState::config_options() now returns &Arc<ConfigOptions> instead of &ConfigOptions
  • Add ConfigOptions to ScalarFunctionArgs
  • AsyncScalarUdf.invoke_async_with_args(..) removed the config_options arg
  • OptimizerConfig.options() return value changed from &ConfigOptions to Arc<ConfigOptions>']
  • a bunch of fn's in the proto code had the argument registry: &dyn FunctionRegistry changed to ctx: &SessionContext
  • new test

Note that there is two todo's in ffi/src/udf/mod.rs related to the serdes of config options in ScalarFunctionArgs. This PR doesn't do any serdes for that.

Are these changes tested?

Existing tests + a new test 'test_config_options_work_for_scalar_func' in user_defined_scalar_functions.rs

Are there any user-facing changes?

Yes, breaking changes as listed above.

alamb and others added 3 commits July 29, 2025 13:41
…ke_async_with_args to remove ConfigOptions arg.
…clone() calls. Fixed an issue with SimplifyExpressions.rewrite(..) not adding config options to execution_props. Added test to verify it works
@github-actions github-actions bot added documentation Improvements or additions to documentation logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate common Related to common crate execution Related to the execution crate proto Related to proto crate functions Changes to functions implementation ffi Changes to the ffi crate physical-plan Changes to the physical-plan crate spark labels Jul 29, 2025
@Omega359 Omega359 changed the title Add ConfigOptions to ScalarFunctionArgs feat: Add ConfigOptions to ScalarFunctionArgs Jul 29, 2025
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Omega359 that looks very helpful and addresses the important problem that builtin functions could not be parametrized by DF runtime properties. I'm planning to review it today

Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this PR on couple of functions and runtime config passes properly to the internal invocation function.

However what comes to my mind is passing runtime properties to built-in SQL functions can be powerful but introduces several trade-offs worth considering.

Good:

It allows function behavior to be dynamically adapted based on the execution context (e.g., user locale, timezone, session settings, environment specifics).
Especially this would be increasingly helpful for downstream projects which have to copy DF function with minor changes and maintain that the code is in sync

Not so good:
DF can end up with bloating code serving the purpose of each downstream project.

Apparently reviewers need to verify this mechanism is used properly

}
}

/// Specify whether to enable the filter_null_keys rule
pub fn filter_null_keys(mut self, filter_null_keys: bool) -> Self {
self.options.optimizer.filter_null_join_keys = filter_null_keys;
Arc::make_mut(&mut self.options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using make_mut can cause extra inner cloning if other Arcs already points to the object https://doc.rust-lang.org/std/sync/struct.Arc.html#method.make_mut

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- I think this is unavoidable if someone wants to modify the config options

I think it is better than what is on main as today every OptimizerConfig requires a clone of the ConfigOptions -- using make_mut means the code will now only clone the config when it actually needs to be modified

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Omega359

I have a few API suggestions. I sketched the major one out here

If we go with that PR I'll add another note to the upgrade guide

I also pushed a commit about the change to async UDFs to the upgrade guide

let expected_schema = Schema::new(vec![Field::new("a", DataType::Utf8, false)]);
let expected = RecordBatch::try_new(
SchemaRef::from(expected_schema),
vec![create_array!(Utf8, ["AEST"])],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@alamb
Copy link
Contributor

alamb commented Jul 31, 2025

FYI @findepi (who I believe is away this week but may be interested in this PR)

@Omega359
Copy link
Contributor Author

Omega359 commented Aug 1, 2025

Good:

It allows function behavior to be dynamically adapted based on the execution context (e.g., user locale, timezone, session settings, environment specifics). Especially this would be increasingly helpful for downstream projects which have to copy DF function with minor changes and maintain that the code is in sync

Not so good: DF can end up with bloating code serving the purpose of each downstream project.

Apparently reviewers need to verify this mechanism is used properly

Indeed. Beyond runtime switching of behaviour for custom UDF's I foresee two main use cases for the core udf's - timezone and spark-like 'ansi' mode.

@alamb
Copy link
Contributor

alamb commented Aug 4, 2025

Not so good: DF can end up with bloating code serving the purpose of each downstream project.
Apparently reviewers need to verify this mechanism is used properly

Indeed. Beyond runtime switching of behaviour for custom UDF's I foresee two main use cases for the core udf's - timezone and spark-like 'ansi' mode.

While I agree that this PR makes it easier to add more functionality (and thus bloat) to the core functions, I don't think it fundamentally changes the need for reviewers to help make that tradeoff when evaluating new features for inclusion

@alamb alamb changed the title feat: Add ConfigOptions to ScalarFunctionArgs feat: Add Arc<ConfigOptions> to ScalarFunctionArgs, Aug 4, 2025
@comphead
Copy link
Contributor

comphead commented Aug 4, 2025

Are we good to merge the PR?

@alamb alamb changed the title feat: Add Arc<ConfigOptions> to ScalarFunctionArgs, feat: Add Arc<ConfigOptions> to ScalarFunctionArgs, don't copy ConfigOptions on each query Aug 4, 2025
@@ -68,9 +72,10 @@ impl ExecutionProps {

/// Marks the execution of query started timestamp.
/// This also instantiates a new alias generator.
pub fn start_execution(&mut self) -> &Self {
pub fn mark_start_execution(&mut self, config_options: Arc<ConfigOptions>) -> &Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this is technically a breaking API. It would be nicer for downstream users if we kept the old function and deprecated it so the compiler told them what was happening.

I pushed a commit to do that

}
}

/// Specify whether to enable the filter_null_keys rule
pub fn filter_null_keys(mut self, filter_null_keys: bool) -> Self {
self.options.optimizer.filter_null_join_keys = filter_null_keys;
Arc::make_mut(&mut self.options)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah -- I think this is unavoidable if someone wants to modify the config options

I think it is better than what is on main as today every OptimizerConfig requires a clone of the ConfigOptions -- using make_mut means the code will now only clone the config when it actually needs to be modified

@alamb
Copy link
Contributor

alamb commented Aug 4, 2025

I touched up some more docs, added another note to the upgrade guide, and filed a ticket for the TODO items

I think this PR is ready to go and I plan to merge it tomorrow unless anyone else would like additional time to comment or review

@alamb
Copy link
Contributor

alamb commented Aug 4, 2025

Are we good to merge the PR?

Sorry I was working on a few last touchups - I think we are good to merge it now, but was going to leave it open for one more day to gather any potential additional feedback. Perhaps this is not necessary

@alamb
Copy link
Contributor

alamb commented Aug 5, 2025

🚀

@alamb alamb merged commit e4f16dd into apache:main Aug 5, 2025
28 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 5, 2025

Thanks again @Omega359

.config_options
.entries()
.iter()
.sorted_by(|&l, &r| l.key.cmp(&r.key))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found this code while reviewing #17078

It seems to me that adding this sort of all config options to each scalar function when looking for equality is going to be quite expensive at planning time.

Can we potentially do a pointer comparison first before deciding we need to do a deep compare by value?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within eq impl it's easy to do pointer comparison first, but what about Hash impl?
Is the assumption that eq is called during planning much more often than the hash?

Copy link
Contributor

@alamb alamb Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Within eq impl it's easy to do pointer comparison first, but what about Hash impl? Is the assumption that eq is called during planning much more often than the hash?

I think for the hash implementation we can just not hash the config_options -- while this might result in theoretically more hash collisions it will be way faster and still correct

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for the hash implementation we can just not hash the config_options --

sgtm

hknlof pushed a commit to hknlof/datafusion that referenced this pull request Aug 20, 2025
…onfigOptions` on each query (apache#16970)

* Add `ConfigOptions` to ExecutionProps when execution is started

* Add ConfigOptions to ScalarFunctionArgs, refactor AsyncScalarUDF.invoke_async_with_args to remove ConfigOptions arg.

* Updated OptimizerConfig.options() -> Arc<ConfigOptions> to eliminate clone() calls. Fixed an issue with SimplifyExpressions.rewrite(..) not adding config options to execution_props. Added test to verify it works

* Test update.

* Add note in upgrade guide

* Use Arc all the way down

* start_execution -> mark_start_execution

* Update datafusion/expr/src/execution_props.rs

Co-authored-by: Andrew Lamb <[email protected]>

* Update comments

* Avoid API breakage via #deprecated

* Update upgrade guide for Arc<ConfigOptions> change

* Apply suggestions from code review

* fmt

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation execution Related to the execution crate ffi Changes to the ffi crate functions Changes to functions implementation logical-expr Logical plan and expressions optimizer Optimizer rules physical-expr Changes to the physical-expr crates physical-plan Changes to the physical-plan crate proto Related to proto crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants