-
Notifications
You must be signed in to change notification settings - Fork 1.6k
feat: Add Arc<ConfigOptions>
to ScalarFunctionArgs
, don't copy ConfigOptions
on each query
#16970
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ke_async_with_args to remove ConfigOptions arg.
…clone() calls. Fixed an issue with SimplifyExpressions.rewrite(..) not adding config options to execution_props. Added test to verify it works
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @Omega359 that looks very helpful and addresses the important problem that builtin functions could not be parametrized by DF runtime properties. I'm planning to review it today
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested this PR on couple of functions and runtime config passes properly to the internal invocation function.
However what comes to my mind is passing runtime properties to built-in SQL functions can be powerful but introduces several trade-offs worth considering.
Good:
It allows function behavior to be dynamically adapted based on the execution context (e.g., user locale, timezone, session settings, environment specifics).
Especially this would be increasingly helpful for downstream projects which have to copy DF function with minor changes and maintain that the code is in sync
Not so good:
DF can end up with bloating code serving the purpose of each downstream project.
Apparently reviewers need to verify this mechanism is used properly
} | ||
} | ||
|
||
/// Specify whether to enable the filter_null_keys rule | ||
pub fn filter_null_keys(mut self, filter_null_keys: bool) -> Self { | ||
self.options.optimizer.filter_null_join_keys = filter_null_keys; | ||
Arc::make_mut(&mut self.options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using make_mut
can cause extra inner cloning if other Arcs already points to the object https://doc.rust-lang.org/std/sync/struct.Arc.html#method.make_mut
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- I think this is unavoidable if someone wants to modify the config options
I think it is better than what is on main as today every OptimizerConfig requires a clone of the ConfigOptions
-- using make_mut
means the code will now only clone the config when it actually needs to be modified
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @Omega359
I have a few API suggestions. I sketched the major one out here
If we go with that PR I'll add another note to the upgrade guide
I also pushed a commit about the change to async UDFs to the upgrade guide
let expected_schema = Schema::new(vec![Field::new("a", DataType::Utf8, false)]); | ||
let expected = RecordBatch::try_new( | ||
SchemaRef::from(expected_schema), | ||
vec![create_array!(Utf8, ["AEST"])], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice
FYI @findepi (who I believe is away this week but may be interested in this PR) |
Co-authored-by: Andrew Lamb <[email protected]>
Indeed. Beyond runtime switching of behaviour for custom UDF's I foresee two main use cases for the core udf's - timezone and spark-like 'ansi' mode. |
While I agree that this PR makes it easier to add more functionality (and thus bloat) to the core functions, I don't think it fundamentally changes the need for reviewers to help make that tradeoff when evaluating new features for inclusion |
Arc<ConfigOptions>
to ScalarFunctionArgs
,
Are we good to merge the PR? |
Arc<ConfigOptions>
to ScalarFunctionArgs
, Arc<ConfigOptions>
to ScalarFunctionArgs
, don't copy ConfigOptions
on each query
@@ -68,9 +72,10 @@ impl ExecutionProps { | |||
|
|||
/// Marks the execution of query started timestamp. | |||
/// This also instantiates a new alias generator. | |||
pub fn start_execution(&mut self) -> &Self { | |||
pub fn mark_start_execution(&mut self, config_options: Arc<ConfigOptions>) -> &Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think this is technically a breaking API. It would be nicer for downstream users if we kept the old function and deprecated it so the compiler told them what was happening.
I pushed a commit to do that
} | ||
} | ||
|
||
/// Specify whether to enable the filter_null_keys rule | ||
pub fn filter_null_keys(mut self, filter_null_keys: bool) -> Self { | ||
self.options.optimizer.filter_null_join_keys = filter_null_keys; | ||
Arc::make_mut(&mut self.options) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah -- I think this is unavoidable if someone wants to modify the config options
I think it is better than what is on main as today every OptimizerConfig requires a clone of the ConfigOptions
-- using make_mut
means the code will now only clone the config when it actually needs to be modified
I touched up some more docs, added another note to the upgrade guide, and filed a ticket for the TODO items I think this PR is ready to go and I plan to merge it tomorrow unless anyone else would like additional time to comment or review |
Sorry I was working on a few last touchups - I think we are good to merge it now, but was going to leave it open for one more day to gather any potential additional feedback. Perhaps this is not necessary |
🚀 |
Thanks again @Omega359 |
.config_options | ||
.entries() | ||
.iter() | ||
.sorted_by(|&l, &r| l.key.cmp(&r.key)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this code while reviewing #17078
It seems to me that adding this sort of all config options to each scalar function when looking for equality is going to be quite expensive at planning time.
Can we potentially do a pointer comparison first before deciding we need to do a deep compare by value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within eq impl it's easy to do pointer comparison first, but what about Hash impl?
Is the assumption that eq is called during planning much more often than the hash?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Within eq impl it's easy to do pointer comparison first, but what about Hash impl? Is the assumption that eq is called during planning much more often than the hash?
I think for the hash implementation we can just not hash the config_options
-- while this might result in theoretically more hash collisions it will be way faster and still correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for the hash implementation we can just not hash the
config_options
--
sgtm
…onfigOptions` on each query (apache#16970) * Add `ConfigOptions` to ExecutionProps when execution is started * Add ConfigOptions to ScalarFunctionArgs, refactor AsyncScalarUDF.invoke_async_with_args to remove ConfigOptions arg. * Updated OptimizerConfig.options() -> Arc<ConfigOptions> to eliminate clone() calls. Fixed an issue with SimplifyExpressions.rewrite(..) not adding config options to execution_props. Added test to verify it works * Test update. * Add note in upgrade guide * Use Arc all the way down * start_execution -> mark_start_execution * Update datafusion/expr/src/execution_props.rs Co-authored-by: Andrew Lamb <[email protected]> * Update comments * Avoid API breakage via #deprecated * Update upgrade guide for Arc<ConfigOptions> change * Apply suggestions from code review * fmt --------- Co-authored-by: Andrew Lamb <[email protected]>
Which issue does this PR close?
Closes
SessionConfig
reference toScalarFunctionArgs
#13519This is a continuation of @alamb's PR
ConfigOptions
to ExecutionProps when execution is started #16661Relates to:
SessionConfig
reference toScalarFunctionArgs
#13519datafusion.execution.time_zone
is not used for basic time zone inference #13212Alternative to:
ConfigOptions
to ExecutionProps when execution is started #16661Rationale for this change
Allow udf's to access df config to allow for their behaviour to change based on configuration. For example, allows date and timestamp udf's to use a different timezone than UTC or to allow date/timestamp parsing to have ANSI behaviour when parsing fails.
What changes are included in this PR?
Arc<ConfigOptions>
toExecutionProps
when the execution is started&Arc<ConfigOptions>
instead of &ConfigOptionsConfigOptions
toScalarFunctionArgs
&ConfigOptions
toArc<ConfigOptions>']
registry: &dyn FunctionRegistry
changed toctx: &SessionContext
Note that there is two todo's in ffi/src/udf/mod.rs related to the serdes of config options in ScalarFunctionArgs. This PR doesn't do any serdes for that.
Are these changes tested?
Existing tests + a new test 'test_config_options_work_for_scalar_func' in user_defined_scalar_functions.rs
Are there any user-facing changes?
Yes, breaking changes as listed above.