Introduce Async User Defined Functions #14837

goldmedal · 2025-02-23T13:52:17Z

Which issue does this PR close?

Closes Async User Defined Functions (UDF) #6518.

Rationale for this change

I have been working with @alamb to implement the functional for the async UDF.

Implement general purpose async functions goldmedal/datafusion-llm-function#1

It introduces the following trait:

#[async_trait]
pub trait AsyncScalarUDFImpl: Debug + Send + Sync {
    /// the function cast as any
    fn as_any(&self) -> &dyn Any;

    /// The name of the function
    fn name(&self) -> &str;

    /// The signature of the function
    fn signature(&self) -> &Signature;

    /// The return type of the function
    fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType>;

    /// The ideal batch size for this function.
    ///
    /// This is used to determine what size of data to be evaluated at once.
    /// If None, the whole batch will be evaluated at once.
    fn ideal_batch_size(&self) -> Option<usize> {
        None
    }

    /// Invoke the function asynchronously with the async arguments
    async fn invoke_async_with_args(
        &self,
        args: AsyncScalarFunctionArgs,
        option: &ConfigOptions,
    ) -> Result<ArrayRef>;
}

It allows the user to implement the UDF for invoking some external remote function in the query.
Given an async udf async_equal, the plan would look like:

> explain select async_equal(a.id, 1) from animal a
+---------------+----------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                   |
+---------------+----------------------------------------------------------------------------------------+
| logical_plan  | Projection: async_equal(a.id, Int64(1))                                                |
|               |   SubqueryAlias: a                                                                     |
|               |     TableScan: animal projection=[id]                                                  |
| physical_plan | ProjectionExec: expr=[__async_fn_0@1 as async_equal(a.id,Int64(1))]                    |
|               |   AsyncFuncExec: async_expr=[async_expr(name=__async_fn_0, expr=async_equal(id@0, 1))] |
|               |     CoalesceBatchesExec: target_batch_size=8192                                        |
|               |       DataSourceExec: partitions=1, partition_sizes=[1]                                |
|               |                                                                                        |
+---------------+----------------------------------------------------------------------------------------+

To reduce the number of invoking the async function, CoalesceAsyncExecInput rule is used for coalescing the input batch of AsyncFuncExec.

See the details usages in the example.

What changes are included in this PR?

Remaining Work

Support for ProjectExec
Support for FilterExec
Support for Join Expression

Maybe implement in the follow-up PR

Async aggregation function
Async window function
Async table function (?

Are these changes tested?

Are there any user-facing changes?

alamb · 2025-02-24T15:55:09Z

😮 -- thanks @goldmedal -- I'll put this on my list of things to review

goldmedal · 2025-03-12T02:59:45Z

@alamb Sorry for the late. This PR is ready for review now.
I want to focus on Projection and Filter, which currently invoke the async UDF. After ensuring the approach makes sense, I'll create the follow-up PR for other plans.

alamb · 2025-03-12T22:22:07Z

Thanks I'll put it on my list

datafusion/physical-expr/src/async_scalar_function.rs

berkaysynnada · 2025-04-22T12:42:22Z

What's the status of this PR?

goldmedal · 2025-04-22T15:06:35Z

What's the status of this PR?

It's ready to review. I'm still waiting for someone to help review it.

berkaysynnada · 2025-04-23T14:06:50Z

What's the status of this PR?

It's ready to review. I'm still waiting for someone to help review it.

Thanks @goldmedal. We'll need this as well, so let's revive it. I'm putting this into my review list.

berkaysynnada

Hi again @goldmedal. I finally found some time to look into this. First of all, thank you for your work. This PR is in very good shape overall, and easy to follow the idea.

However, when I first imagined the design of this feature, I was thinking of approaching the problem from a different angle, which I believe could simplify things quite a bit:

What if we just added a new method to the PhysicalExpr trait, like evaluate_async()? We could then call this from streams that might involve async work. The default implementation would delegate to evaluate(), but in the case of ScalarFunctionExpr, we could branch depending on the function type.

This way, we wouldn't need to introduce a new physical rule or operator, which add overhead to both planning and execution. As I mentioned below, the special handling in the planner isn't well scalable IMO.

I'd love to hear your thoughts on my suggestion

berkaysynnada · 2025-05-10T18:57:22Z

datafusion/core/src/physical_planner.rs

@@ -775,12 +776,44 @@ impl DefaultPhysicalPlanner {

                let runtime_expr =
                    self.create_physical_expr(predicate, input_dfschema, session_state)?;
+
+                let filter = match self.try_plan_async_exprs(


Do we need to apply this pattern for every operator which has PhysicalExprs inside it that need to be evaluated during runtime? I think we can figure out another way to not make people modify the planner code for such every operator

I think at a really high level this pattern is basically the same as the "Common Subexpression Elimination" and many of the other optimizer passes -- that is pulling some subset of the expressions into a new node, and rewriting the others.

If we want to avoid having to follow the same model I think we could follow the model of some of the other recent optimizer passes and add a method to ExecutionPlan -- something like this perhaps

trait ExecutionPlan { /// Factor all async expressions in this ExecutionPlan from any internal expressions /// returning a list of such Async expressions and the rewritten plan /// /// The async expression values will be provided to the rewritten plan after all the existing /// input columns rewrite_async(&self) -> Transformed<(Vec<AsyncExpr>, Arc<dyn ExecutionPlan>) -> { // default to not supporting async functins Transformed::no() } }

something like this perhaps

rewritten plan is (async_exec + original plan)?

I think at a really high level this pattern is basically the same as the "Common Subexpression Elimination" and many of the other optimizer passes -- that is pulling some subset of the expressions into a new node, and rewriting the others.

I see the pattern now, but IMO for this async evaluation, adding a new operator for each async fn in the query seems a bit unnatural to me. I feel like we should encapsulate this feature in PhysicalExpr's level.

I agree it feels unnatural

The downside in my mind of trying to put it in PhysicalExpr is then it complicates implementing PhysicalExpr even when most PhysicalExprs don't need to worry about it

Thus I think treating async udfs specially while not ideal will make it easier to understand how different they are

adriangb · 2025-05-10T21:44:05Z

What if we just added a new method to the PhysicalExpr trait, like evaluate_async()? We could then call this from streams that might involve async work. The default implementation would delegate to evaluate(), but in the case of ScalarFunctionExpr, we could branch depending on the function type.

How would that work going from sync -> async? For example: 1 = 2 OR 1 = call_llm_model_async(). I imagine this would build something like BinaryExpr(BinaryExpr(1, Eq, 2), Or, ScalarFunc(call_llm_model_async)). If we call evaluate_async on the outer BinaryExpr it would call evaluate() by default so now you're in sync world. How do you break back into async world? Do we pass around a handle to the tokio runtime?

alamb

Thank you @goldmedal -- I am sorry I missed this PR for so long. I think it is a great extension for DataFusion and will make using DataFusion with various new LLMs / services easier

I am approving this PR as I think it follows the existing patterns for optimizers and adds some key functionality

However, note I am quite biased as I had something to do with this pattern here goldmedal/datafusion-llm-function#1. Thus I believe that we should address @berkaysynnada and @adriangb 's concerns prior to megign

I think we should file some follow on tickets to

Add support for the remaining nodes
Add some more documentation / examples

alamb · 2025-05-11T10:34:33Z

datafusion/core/src/physical_planner.rs

@@ -775,12 +776,44 @@ impl DefaultPhysicalPlanner {

                let runtime_expr =
                    self.create_physical_expr(predicate, input_dfschema, session_state)?;
+
+                let filter = match self.try_plan_async_exprs(


I think at a really high level this pattern is basically the same as the "Common Subexpression Elimination" and many of the other optimizer passes -- that is pulling some subset of the expressions into a new node, and rewriting the others.

If we want to avoid having to follow the same model I think we could follow the model of some of the other recent optimizer passes and add a method to ExecutionPlan -- something like this perhaps

trait ExecutionPlan { /// Factor all async expressions in this ExecutionPlan from any internal expressions /// returning a list of such Async expressions and the rewritten plan /// /// The async expression values will be provided to the rewritten plan after all the existing /// input columns rewrite_async(&self) -> Transformed<(Vec<AsyncExpr>, Arc<dyn ExecutionPlan>) -> { // default to not supporting async functins Transformed::no() } }

datafusion-examples/examples/async_udf.rs

berkaysynnada · 2025-05-11T10:52:01Z

How would that work going from sync -> async? For example: 1 = 2 OR 1 = call_llm_model_async(). I imagine this would build something like BinaryExpr(BinaryExpr(1, Eq, 2), Or, ScalarFunc(call_llm_model_async)). If we call evaluate_async on the outer BinaryExpr it would call evaluate() by default so now you're in sync world. How do you break back into async world? Do we pass around a handle to the tokio runtime?

Easy answer is converting original evaluate()'s to async, and move all evalute() impls to evaluate_sync(), but I cannot fully estimate its effects and challenges. Any comes to your mind?

adriangb · 2025-05-11T11:39:00Z

How would that work going from sync -> async? For example: 1 = 2 OR 1 = call_llm_model_async(). I imagine this would build something like BinaryExpr(BinaryExpr(1, Eq, 2), Or, ScalarFunc(call_llm_model_async)). If we call evaluate_async on the outer BinaryExpr it would call evaluate() by default so now you're in sync world. How do you break back into async world? Do we pass around a handle to the tokio runtime?

Easy answer is converting original evaluate()'s to async, and move all evalute() impls to evaluate_sync(), but I cannot fully estimate its effects and challenges. Any comes to your mind?

I mean that makes sense but sounds like a lot of churn? I'm not sure tbh sync / async coloring is always a pain and I don't know of any good solutions :(

berkaysynnada · 2025-05-11T14:47:32Z

How would that work going from sync -> async? For example: 1 = 2 OR 1 = call_llm_model_async(). I imagine this would build something like BinaryExpr(BinaryExpr(1, Eq, 2), Or, ScalarFunc(call_llm_model_async)). If we call evaluate_async on the outer BinaryExpr it would call evaluate() by default so now you're in sync world. How do you break back into async world? Do we pass around a handle to the tokio runtime?

Easy answer is converting original evaluate()'s to async, and move all evalute() impls to evaluate_sync(), but I cannot fully estimate its effects and challenges. Any comes to your mind?

I mean that makes sense but sounds like a lot of churn? I'm not sure tbh sync / async coloring is always a pain and I don't know of any good solutions :(

I'll try a POC when I find some time, and wonder @alamb's opinion

alamb · 2025-05-11T19:11:11Z

How would that work going from sync -> async? For example: 1 = 2 OR 1 = call_llm_model_async(). I imagine this would build something like BinaryExpr(BinaryExpr(1, Eq, 2), Or, ScalarFunc(call_llm_model_async)). If we call evaluate_async on the outer BinaryExpr it would call evaluate() by default so now you're in sync world. How do you break back into async world? Do we pass around a handle to the tokio runtime?

Easy answer is converting original evaluate()'s to async, and move all evalute() impls to evaluate_sync(), but I cannot fully estimate its effects and challenges. Any comes to your mind?

I mean that makes sense but sounds like a lot of churn? I'm not sure tbh sync / async coloring is always a pain and I don't know of any good solutions :(

I'll try a POC when I find some time, and wonder @alamb's opinion

My feeling (without any solid data) is that using async functions is not ideal because:

The async overhead (e.g. what it takes to make await vs a normal function) could be noticable, but maybe not that big a deal
The fact that everything that calls UDF would have to be async (as only async functions can call other async functions) -- the so called "what color are your functions" problem -- we be quite disruptive.

Another benefit of the approach in this PR is that it requires no changes to any existing functions or APIs (in fact the original POC can be implemented entirely as a DataFusion user defined optimizer extension)

goldmedal · 2025-06-23T16:18:12Z

hi @alamb
I have fixed the conflicts. If no more comments, I think we can merge it.

alamb · 2025-06-23T18:51:54Z

Thanks @goldmedal -- I will file some follow on tickets and then merge

alamb

I think it looks like a good start -- let's merge it and those who want to use it can iterate on it as we go

Thanks @goldmedal

datafusion-examples/examples/async_udf.rs

alamb · 2025-06-23T18:57:20Z

datafusion/core/src/physical_planner.rs

@@ -775,12 +776,44 @@ impl DefaultPhysicalPlanner {

                let runtime_expr =
                    self.create_physical_expr(predicate, input_dfschema, session_state)?;
+
+                let filter = match self.try_plan_async_exprs(


I agree it feels unnatural

The downside in my mind of trying to put it in PhysicalExpr is then it complicates implementing PhysicalExpr even when most PhysicalExprs don't need to worry about it

Thus I think treating async udfs specially while not ideal will make it easier to understand how different they are

alamb · 2025-06-23T18:58:39Z

datafusion/expr/src/async_udf.rs

+    fn signature(&self) -> &Signature;
+
+    /// The return type of the function
+    fn return_type(&self, _arg_types: &[DataType]) -> Result<DataType>;


Since we originally made this PR, the udfs have been changed again (to use Fields/FieldRef rather than DataType)...

goldmedal · 2025-06-24T01:55:08Z

Thanks @alamb @berkaysynnada @kylebarron @ozankabak @Omega359 @paleolimbot for reviewing and suggestions 🚀

alamb · 2025-07-21T20:28:29Z

I made a follow on PR to update the docs a bit:

Improve async_udf example and docs #16846

This is so exciting

github-actions bot added logical-expr Logical plan and expressions physical-expr Changes to the physical-expr crates optimizer Optimizer rules core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) labels Feb 23, 2025

goldmedal mentioned this pull request Feb 23, 2025

Async User Defined Functions (UDF) #6518

Closed

This was referenced Feb 24, 2025

Weekly Plan (Andrew Lamb) Feb 24, 2025 #14850

Closed

Weekly Plan (Andrew Lamb) March 3, 2025 #14978

Closed

alamb mentioned this pull request Mar 10, 2025

Weekly Plan (Andrew Lamb) March 10, 2025 #15121

Closed

11 tasks

goldmedal force-pushed the epic/async-udf branch from 6b29489 to 848e175 Compare March 11, 2025 13:29

goldmedal marked this pull request as ready for review March 12, 2025 02:54

alamb mentioned this pull request Mar 17, 2025

March 17, 2025: This week(s) in DataFusion #15269

Closed

Omega359 reviewed Mar 21, 2025

View reviewed changes

datafusion/physical-expr/src/async_scalar_function.rs Outdated Show resolved Hide resolved

goldmedal force-pushed the epic/async-udf branch from 4d8b40e to 78ecbe6 Compare March 22, 2025 06:34

goldmedal force-pushed the epic/async-udf branch from 78ecbe6 to a493c33 Compare April 22, 2025 15:12

Omega359 mentioned this pull request Apr 30, 2025

[DISCUSSION] DataFusion Road Map: Q3-Q4 2025 #15878

Open

berkaysynnada reviewed May 10, 2025

View reviewed changes

alamb approved these changes May 11, 2025

View reviewed changes

alamb mentioned this pull request May 11, 2025

Weekly Plan: Andrew Lamb 2025-05-12 #16022

Closed

24 tasks

goldmedal added 15 commits June 23, 2025 20:51

simple example

fe12d72

enhance comment

875d4e5

enhance doc and fix test

017b111

fix clippy and fmt

da065ef

add missing dependency

3a2e7ff

fix clippy

cce1586

rename the symbol

5fbdd04

cargo fmt

92e8144

fix fmt and rebase

ea0ce98

add return_field_from_args for async scalar udf

4d4145b

modified into_scalar_udf method

68b2203

add the async scalar udf in udfs doc

6f05ec3

pretty doc

68debd5

fix doc test

98cf8e2

fix merge conflict

5f55674

goldmedal force-pushed the epic/async-udf branch from 2462dd0 to 5f55674 Compare June 23, 2025 12:58

This was referenced Jun 23, 2025

[EPIC] More Async User Defined Function work #16520

Open

[Async UDF] Add high level context / examples for user defined async functions #16521

Closed

alamb approved these changes Jun 23, 2025

View reviewed changes

alamb merged commit cdaaef7 into apache:main Jun 23, 2025
30 checks passed

This was referenced Jun 23, 2025

Update AsyncScalarUDFImpl API to match ScalarUDFImpl API #16522

Closed

Simplify AsyncScalarUdfImpl so it extends ScalarUdfImpl #16523

Merged

[Blog] Async Scalar User Defined Functions #16525

Open

feat: Add ConfigOptions to ScalarFunctionArgs #13527

Closed

goldmedal deleted the epic/async-udf branch June 24, 2025 01:49

alamb mentioned this pull request Jul 21, 2025

Improve async_udf example and docs #16846

Merged

Introduce Async User Defined Functions #14837

Introduce Async User Defined Functions #14837

Uh oh!

Conversation

goldmedal commented Feb 23, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Remaining Work

Maybe implement in the follow-up PR

Are these changes tested?

Are there any user-facing changes?

Uh oh!

alamb commented Feb 24, 2025

Uh oh!

goldmedal commented Mar 12, 2025

Uh oh!

alamb commented Mar 12, 2025

Uh oh!

Uh oh!

berkaysynnada commented Apr 22, 2025

Uh oh!

goldmedal commented Apr 22, 2025

Uh oh!

berkaysynnada commented Apr 23, 2025

Uh oh!

berkaysynnada left a comment

Choose a reason for hiding this comment

Uh oh!

berkaysynnada May 10, 2025

Choose a reason for hiding this comment

Uh oh!

alamb May 11, 2025

Choose a reason for hiding this comment

Uh oh!

berkaysynnada May 11, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

adriangb commented May 10, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb May 11, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

berkaysynnada commented May 11, 2025

Uh oh!

adriangb commented May 11, 2025

Uh oh!

berkaysynnada commented May 11, 2025

Uh oh!

alamb commented May 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

goldmedal commented Jun 23, 2025

Uh oh!

alamb commented Jun 23, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alamb Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Jun 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

goldmedal commented Jun 24, 2025

Uh oh!

alamb commented Jul 21, 2025

Uh oh!

Uh oh!

alamb commented May 11, 2025 •

edited

Loading