Out-of-the-box support for UDFs #11851
Replies: 8 comments 10 replies
-
Extending dbt's support to new node types is natural and welcome progression of dbt. After seeds and snapshots, functions are definitely needed! Being able to version control the functions - pure gold! The reference is not the prettiest one: Jinja in functions - 100% yes! |
Beta Was this translation helpful? Give feedback.
-
Top level opinions:
My pitchGiven those, my recommendation is to lean into macros and UDFs having identical calling syntax. You wouldn't be able to have a macro and a UDF with the same name, but that's fine - you can't have a seed and a model with the same name either, and they have the same invocation syntax ( So it would look like this:
I particularly like this because it makes the migration really clean and tidy for anyone who already has hardcoded UDFs in their project - they just have to wrap them in some curlies (and quote their column references 😑) and they're done. Alternative approachFailing that, I'd fall back to
The only benefit of that approach is it could be expanded to support versions in the future. I don't think the jankiness of double parens and more quotes are worth it though. Bonus questionAre we supporting user-defined table functions as well? I assume I could do
|
Beta Was this translation helpful? Give feedback.
-
So great to hear—I'm looking forward to it! A few thoughts: 1. UDF OverloadingIn Snowflake (and probably in some more engines), it's possible to create overloaded UDFs. For example: create or replace function my_udf(a varchar);
create or replace function my_udf(a varchar, b int); Both functions will coexist in the schema. That means if you reference 2. Support for Table Functions (UDFs as Parameterized Views)Table functions are essentially parameterized views. +1 for their support - it enables a wide range of use cases, definitely a welcome addition! 3. Schema Validation During UDF CreationIn Snowflake, UDFs are validated at creation time. For instance: create or replace temp function my_udf() returns int as $$
select * from my_tbl
$$ If
This highlights an advantage of using 4. UDFs and Unit Testing ChallengesTesting downstream models that depend on UDFs is tricky:
I am not sure how it's best to tackle it - maybe just acknowledging that current dbt unit testing does not support these patterns. 5. Named Arguments and Default Values in UDF CallsThe idea to wrap UDF input arguments in double braces ( |
Beta Was this translation helpful? Give feedback.
-
@graciegoheen you might find it interesting to see what I did with Paul Brabban's prototype... I added support for additional languages and function types to their UDF custom materialization: #10395 (comment) |
Beta Was this translation helpful? Give feedback.
-
Hey thanks for the shout-out @graciegoheen! Nice write up! I can see there's already been some discussion, but I'll try and give my own feedback and answer the OP questions with reasoning rather than start dialogues... Quick context: I don't use dbt Cloud, I typically test and run my dbt pipelines off my CI/CD system, not a separate orchestrator, I don't know anything about "Fusion" (SDF?) and I've only used dbt with BigQuery, Athena, Snowflake and DuckDB off the top of my head. I have a few open source projects that are examples of my use of UDFs (might be useful to try out candidates) and I've been involved with data platform infra and building dbt projects with I think around 1500 models & tests, but I prefer smaller more "meshy" setups. functions/schema.yml
Unit testingUnit testing should be easy enough to implement for UDFs, but probably needs a bit of thinking about in order to cover other types of functions. Table functions again jump out as a bit different. My ways of working mean I'm always able to run my tests in a non-production environment so I'm not really affected by the functions being persisted for testing, although the fact that UDFs/UDAFs typically support the Now I think about it, seems a bit similar to a model's Open questions
I would kind of expect to be able to handle UDFs, UDAFs and stored procedures in a similar way - adding one seems likely to generate clamour for the others. Then there are table functions, someone else mentioned those - they feel a bit different, but I can see the connection. Might be worth bearing the others in mind when you're looking at UDFs. I published a follow up post covering what I've done with these other types of object this week. |
Beta Was this translation helpful? Give feedback.
-
Hey folks! Dropping by to let you know a community feedback / office hours session that we'll be running in a couple weeks. Thursday, 21 August, 8am Pacific: UDFs as native functionality in dbt
Some supporting resources:
|
Beta Was this translation helpful? Give feedback.
-
UDFs via Jinja (macro‑compatible proposal)Thanks to Grace and the community for kicking this off. Code‑authored UDFs are overdue. There are a dozen ways to add them; I think the simplest, most ergonomic path is: Extend dbt’s macro system to also define, configure, and call UDFs. Why? Macros and functions are both procedural abstractions. You should be able to switch between them with minimal churn—just like you can in C, Rust, or Lisp. Keep one calling convention, one mental model, and reuse dbt’s well tested machinery (dispatch, packaging, docs, CI). Core idea Add a new Jinja block: Functions compile to warehouse UDFs; macros still expand at compile time. One call form everywhere: If a macro and a function share a name, we issue a compile error. Simple example {% macro is_positive_int(a_string) %}
REGEXP_LIKE({{ a_string }}, '^[0-9]+$')
{% endmacro %} ... and its usage select * from my_table where {{ is_positive_int('my_column') }} The same as a UDF looks like this... {{ config(schema='public_string_functions', database='udf_db', language='sql', kind='scalar') }}
-- signature: (a_string STRING) RETURNS BOOLEAN
{% function is_positive_int(a_string) %}
REGEXP_LIKE({{ a_string }}, '^[0-9]+$')
{% endfunction %} ... with same usage select * from my_table where {{ is_positive_int('my_column') }} Files, config, and signatures Like anywhere else function definitions support one file‑scoped config() at the top (schema, database, language, runtime, kind, grants, etc.). Additional top‑level config() calls in the same file are errors; split files if you need different settings. We support two signature styles
If both appear for the same function, error. If a named signature disagrees with the function name, error. Build order & DAG (no magic, just edges), parse and materialize dbt parse introduces for each function definition a Function node. In a addition we record each function application in a new depends_on_functions field. parse expands macros and functions into their corresponding syntactic forms. Function materialization takes the config and the function body and create the proper ``CREATE OR REPLACE FUNCTION` statement . dbt’s topological sort guarantees: functions are uploaded to the warehouse and created before anything that calls them. Function‑to‑function calls are transitively ordered. Even functions referencing tables is supported with further ado, as long as the db adapters supports them. Type checking & analysis With signatures known, dbt/Fusion can typecheck function bodies (SQL) and validate calls at compile time. This works even for non‑SQL languages (Python/Java/Scala), since all functions must declare signatures so that call sites can be validated. Non‑SQL example (Snowflake Python) {{ config(schema='udf_py', database='udf_db', language='python', runtime_version='3.10', kind='scalar', handler='is_positive_int') }}
-- signature: (a_string STRING) RETURNS BOOLEAN
{% function is_positive_int(a_string) %}
def is_positive_int(a_string: str) -> bool:
if a_string is None:
return False
import re
return re.fullmatch(r'[0-9]+', a_string) is not None
{% endfunction %} Aggregates, table functions, and beyond The function configuration kind supports 'scalar', 'aggregate' and 'table'. Concepts and call sites stay identical. Of course UDF,s UDAs and UDTs are only supported if the db adapter actually supports them. Known limits (v1, on purpose) This proposal does not support overloaded UDFs, i.e. UDFs applications that are only distinguishable by their argument signatures. For instance Snowflake has two functions: -- signature: DATEADD(part STRING, offset INT, date DATE) RETURNS DATE
-- signature: DATEADD(part STRING, offset INT, ts TIMESTAMP) RETURNS TIMESTAMP Since dbt doesn't do overload resolution at parse time, we can't distinguish these functions. But that restriction can easily be overcome: just define these two functions using distinct names (e.g., Summary: Why does this proposal work so easily?
|
Beta Was this translation helpful? Give feedback.
-
Thank you for all of the feedback so far! I wanted to provide one update to how we're planning to implement UDFs in dbt. Originally, we proposed referencing a UDF using the existing
We got some great feedback from our engineers (thank you @wolfram-s, @QMalcolm, @akbog, @joellabes) that even though UDFs may seem very similar to other dbt-managed objects (like models, snapshots, and seeds) there's a key distinction:
UDFs:
Because UDFs are actually conceptually different from Relations, we believe that "referencing" these should be distinct in your code (i.e. you should be able to easily scan your model files and see when you're calling a Relation vs. a Callable). So, instead of moving forward with We will need to consider how much overlap there is between |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Background
In our May roadmap post, we proposed a long-discussed addition to the dbt standard: Out-of-the-box support for UDFs.
Since then, we made some improvements to fusion, so it can now understand the UDFs you’ve manually defined in your warehouse for static analysis.
But there’s still appetite for providing a managed experience for UDFs in dbt - where you write your UDF logic alongside the rest of your dbt code and dbt is responsible for creating/updating them in appropriate DAG-order!
What should that spec look like? We’ve done some initial brainstorming, fueled by a lot of content from dbt community stars (THANK YOU):
We’d love to your hear feedback on our proposal before we finalize and start building!
Proposed Spec for UDFs
Let’s walk through an example (thanks Paul for the code snippet!).
The goal is simple: I want to author a new UDF in my warehouse, and I want it to be managed by dbt.
1: First, I should add the UDF to my project.
I will add the logic for my
is_positive_int
UDF to a new file in thefunctions
directory. This file (like models and seeds) will have a 1:1 relationship with the created object in the warehouse, so I’ll name this fileis_positive_int.sql
.I will define my argument and output types, along with other properties and configs in an associated yaml file.
The rendered create UDF statement would be dependent on which adapter you’re using (the benefit of having dbt be the abstraction layer!).
In Snowflake, it would look like:
In Redshift, it would looks like:
2: Second, I want to use that UDF in a model.
I will reference the UDF just like I ref any other dbt-managed warehouse object (models, snapshots, seeds) in dbt, using the
{{ ref(…) }}
macro!When you run a
dbt compile
, the{{ ref('is_positive_int') }}
is replaced by the fully qualified name of the UDFudf_db.udf_schema.is_positive_int
. Just like other dbt-managed warehouse object, UDFs will respect the generate database/schema/alias macros and use my customschema
anddatabase
configurations from above.3: dbt now understands the dependencies between my UDFs and my models!
In my DAG, there would be a dependency between
is_positive_int
→my_model
.So, when I run
dbt build
,is_positive_int
would be created (or updated) before runningmy_model
.4: My UDF keeps my code DRY :)
My the outputted data for
my_model
would look something like:Hmm…
+8
and1.0
are positive numbers. Maybe I should update my regex logic ;)I can create a unit test to assert my expectations.
And when I’m ready to, I can update my logic in a single place (
is_positive_int.sql
), and it will be fixed everywhere I use this function (just like macros)!Open Questions
For the above proposal, we went with “1:1 relationship between a file in your dbt project and a UDF in your warehouse”. But should you instead be able to define multiple UDFs within one file (like macros and snapshots), especially if you have functions that depend on other functions?
For the above proposal, we went with “new node type for functions” where they can be easily visualized in the dbt DAG. But should UDFs instead be a model materialization? A special kind of “materialized” macro? A new node type but named something different?
For the above proposal, we went with “ref your UDFs the same way you ref other objects in your warehouse”. But should we do something different?
For the above proposal, we went with “define argument and output types in a separate yaml file”. But, would it be more ergonomic to define those in the same place you define your logic?
function is_positive_int(a_string str) -> boolean
). This would be possible in Fusion engine (via dbt-jinja), but we want a consistent spec across both engines.Is inclusion in
dbt build
enough, or should we provide some new sub-command (dbt function
- similar todbt seed
for seeds,dbt run
for models,dbt snapshot
for snapshots) for just creating/updating your UDFs? Note: You could alternatively accomplish this viadbt build
and a selector (i.e.dbt build --resource-type function
).What other configs / properties should UDFs have?
persist_docs
?grants
?access
?function-kind
/kind
,overload
,volatility
,variadic
,with
,dialect
?What argument / return types should we support (these will differ from the existing ones for macros since we need to support dialect-specific data types such as
TIMESTAMP_NTZ
, see cross-database data type macros)?Should we also support adding configs directly in the functions file?
Should we support jinja in your function logic? This would enable you to write warehouse-agnostic code (for example, using cross-database macros).
What languages should we support for UDFs? SQL? Python? Java? Scala?
In the example above, we noticed a problem with our UDF logic - I was expecting
+8
and1.0
to be counted as positive. I created a unit test on the model using the UDF to assert my expectations, which works. But, what if I could unit test UDFs directly to validate my logic! What could that spec look like?We're looking forward to hearing your thoughts. Thank you for building with us!
Let’s. Finally. Do. It.
Beta Was this translation helpful? Give feedback.
All reactions