Out-of-the-box support for UDFs #11851

graciegoheen · 2025-07-22T11:41:08Z

graciegoheen
Jul 22, 2025
Maintainer

Background

In our May roadmap post, we proposed a long-discussed addition to the dbt standard: Out-of-the-box support for UDFs.

The idea to manage user-defined functions (UDFs) with dbt is almost as old as dbt — and it’s one that has come up every few years since.

UDFs enable users to define and register custom functions within the warehouse. Like dbt macros, they enable DRY reuse of code; unlike macros, UDFs can be defined in languages other than SQL (Python, Java, Scala, …) and they can be used by queries outside of dbt.

There are two direct benefits to dbt-managed UDFs:

dbt will manage (create, update, rename) UDFs as part of DAG execution. If you want to update a UDF and a model that calls it, you can test the changes together in a development environment; propose those changes in a single pull request that goes through CI; and deploy them together into your production environment. dbt will ensure that the UDF is created before attempting to build the model that references it. And we could even imagine supporting unit tests on UDFs — which are functions, after all!

Fusion’s SQL comprehension is dialect-aware, but it is not yet UDF-aware. Today, if you call UDFs within a dbt model, you need to turn off Fusion’s static analysis for that model’s SQL. By supporting UDFs in the dbt framework, Fusion can also support them in its static analysis. We’re taking inspiration from Fusion’s antecedent, SDF, which supported UDFs for exactly this reason: https://docs.sdf.com/guide/advanced/udf

Imagine if the dbt framework knew about user-defined functions. The Core engine could manage the creation of UDFs as data warehouse objects. The Fusion engine could take it one step further, by also validating your UDFs’ SQL and statically analyzing the SQL of dbt models that call those UDFs.

Since then, we made some improvements to fusion, so it can now understand the UDFs you’ve manually defined in your warehouse for static analysis.

But there’s still appetite for providing a managed experience for UDFs in dbt - where you write your UDF logic alongside the rest of your dbt code and dbt is responsible for creating/updating them in appropriate DAG-order!

What should that spec look like? We’ve done some initial brainstorming, fueled by a lot of content from dbt community stars (THANK YOU):

Paul Brabban's discussion, experimental package, and articles
Alice Naghshineh’s issue and PR to dbt-bigquery
Everyone who posted and/or commented on Discourse

We’d love to your hear feedback on our proposal before we finalize and start building!

Proposed Spec for UDFs

Let’s walk through an example (thanks Paul for the code snippet!).

The goal is simple: I want to author a new UDF in my warehouse, and I want it to be managed by dbt.

1: First, I should add the UDF to my project.

I will add the logic for my is_positive_int UDF to a new file in the functions directory. This file (like models and seeds) will have a 1:1 relationship with the created object in the warehouse, so I’ll name this file is_positive_int.sql.

# functions/is_positive_int.sql

REGEXP_CONTAINS(a_string, r'^[0-9]+$')

I will define my argument and output types, along with other properties and configs in an associated yaml file.

# functions/schema.yml

functions:
  - name: is_positive_int
    description: my UDF that determines if a string represents a positive (+) integer
    config:
      schema: udf_schema
      database: udf_db
    arguments: 
      - name: a_string
        type: string
        description: the string that I want to check if it's representing a positive integer (like "10") 
    returns:
      type: boolean
      
    # likely more properties and/or configs (see open questions below)

The rendered create UDF statement would be dependent on which adapter you’re using (the benefit of having dbt be the abstraction layer!).

In Snowflake, it would look like:

CREATE OR REPLACE FUNCTION my_schema.is_positive_int(a_string STRING)
RETURNS BOOLEAN
AS (
  REGEXP_CONTAINS(a_string, r'^[0-9]+$')
);

In Redshift, it would looks like:

CREATE OR REPLACE FUNCTION my_schema.is_positive_int(a_string STRING)
RETURNS BOOLEAN
IMMUTABLE
AS $$ REGEXP_CONTAINS(a_string, r'^[0-9]+$'); $$
LANGUAGE SQL;

2: Second, I want to use that UDF in a model.

I will reference the UDF just like I ref any other dbt-managed warehouse object (models, snapshots, seeds) in dbt, using the {{ ref(…) }} macro!

# models/my_model.sql

select

  maybe_positive_int_column,
  {{ ref('is_positive_int') }}('maybe_positive_int_column')

from {{ ref('a_model_i_like') }}

When you run a dbt compile, the {{ ref('is_positive_int') }} is replaced by the fully qualified name of the UDF udf_db.udf_schema.is_positive_int. Just like other dbt-managed warehouse object, UDFs will respect the generate database/schema/alias macros and use my custom schema and database configurations from above.

# models/my_model.sql

select

  maybe_positive_int_column,
  udf_db.udf_schema.is_positive_int(maybe_positive_int_column) as is_positive

from analytics.dbt_ggoheen.a_model_i_like

3: dbt now understands the dependencies between my UDFs and my models!

In my DAG, there would be a dependency between is_positive_int → my_model.

So, when I run dbt build, is_positive_int would be created (or updated) before running my_model.

4: My UDF keeps my code DRY :)

My the outputted data for my_model would look something like:

maybe_positive_int_column	is_positive
10	true
-4	false
+8	false
1.0	false

Hmm… +8 and 1.0 are positive numbers. Maybe I should update my regex logic ;)

I can create a unit test to assert my expectations.

unit_tests:
  - name: test_is_positive_int
    description: "Check my is_positive_int logic captures all known edge cases"
    model: my_model
    given:
      - input: ref('a_model_i_like')
        rows:
          - {maybe_positive_int_column: 10}
          - {maybe_positive_int_column: -4}
          - {maybe_positive_int_column: +8}
          - {maybe_positive_int_column: 1.0}
    expect:
      rows:
        - {maybe_positive_int_column: 10, is_positive: true}
        - {maybe_positive_int_column: -4, is_positive: false}
        - {maybe_positive_int_column: +8, is_positive: true}
        - {maybe_positive_int_column: 1.0, is_positive: true}

And when I’m ready to, I can update my logic in a single place (is_positive_int.sql), and it will be fixed everywhere I use this function (just like macros)!

Open Questions

For the above proposal, we went with “1:1 relationship between a file in your dbt project and a UDF in your warehouse”. But should you instead be able to define multiple UDFs within one file (like macros and snapshots), especially if you have functions that depend on other functions?
For the above proposal, we went with “new node type for functions” where they can be easily visualized in the dbt DAG. But should UDFs instead be a model materialization? A special kind of “materialized” macro? A new node type but named something different?

For the above proposal, we went with “ref your UDFs the same way you ref other objects in your warehouse”. But should we do something different?

# IN CURRENT PROPOSAL
# ref is how we reference dbt-managed objects in the warehouse - models, seeds, and snapshots
# if this is the approach, then UDF names + model/seed/snapshot names need to be globally unique
{{ ref(my_function) }}(data)

# alternative 1: jinja function per UDF
# this would be more similar to how we call macros
# if this is the approach, then UDF names + macro names need to be globally unique
{{ my_function(data) }}

# alternative 2: could make a new reserved namespace
# this would avoid namespace issue from above
{{ functions.my_function(data) }}
{{ udfs.my_function(data) }}

# alternative 3: could make a new jinja function
# this would be more similar to how we call sources
{{ function(my_function) }}(data)
{{ udf(my_function) }}(data)

For the above proposal, we went with “define argument and output types in a separate yaml file”. But, would it be more ergonomic to define those in the same place you define your logic?
- If we changed the way jinja rendering works, we could define argument names types in function signature alongside definition (example: function is_positive_int(a_string str) -> boolean). This would be possible in Fusion engine (via dbt-jinja), but we want a consistent spec across both engines.
Is inclusion in dbt build enough, or should we provide some new sub-command (dbt function - similar to dbt seed for seeds, dbt run for models, dbt snapshot for snapshots) for just creating/updating your UDFs? Note: You could alternatively accomplish this via dbt build and a selector (i.e. dbt build --resource-type function).
What other configs / properties should UDFs have?
- in dbt that exist for other node types: persist_docs? grants? access?
- in the original SDF spec: function-kind / kind, overload, volatility, variadic, with, dialect?
What argument / return types should we support (these will differ from the existing ones for macros since we need to support dialect-specific data types such as TIMESTAMP_NTZ, see cross-database data type macros)?

Should we also support adding configs directly in the functions file?

# functions/is_positive_int.sql

{{ config(
    database='udf_db'
) }}

Should we support jinja in your function logic? This would enable you to write warehouse-agnostic code (for example, using cross-database macros).
What languages should we support for UDFs? SQL? Python? Java? Scala?
- I chose a simple SQL example above (that could be a macro instead), but a big use case of UDFs is for logic you can’t easily write in SQL!
In the example above, we noticed a problem with our UDF logic - I was expecting +8 and 1.0 to be counted as positive. I created a unit test on the model using the UDF to assert my expectations, which works. But, what if I could unit test UDFs directly to validate my logic! What could that spec look like?

We're looking forward to hearing your thoughts. Thank you for building with us!

Let’s. Finally. Do. It.

moseleyi · 2025-07-22T13:22:05Z

moseleyi
Jul 22, 2025

Extending dbt's support to new node types is natural and welcome progression of dbt. After seeds and snapshots, functions are definitely needed! Being able to version control the functions - pure gold!

The reference is not the prettiest one: {{ ref('is_positive_int') }}('maybe_positive_int_column') but I understand we can't have everything.

Jinja in functions - 100% yes!

0 replies

joellabes · 2025-07-23T00:15:08Z

joellabes
Jul 23, 2025
Collaborator

Top level opinions:

Biiiiiiiiiig thumbs down on using ref for something that doesn't return a Relation
If there are curly braces, everything should be inside the curlies. You can see this today in the ugly adapter.dispatch syntax
Hub packages or Mesh projects might like to vend UDFs to their users, so they'll need to be addressable and disambiguated too, even if they're prefixless by default.

My pitch

Given those, my recommendation is to lean into macros and UDFs having identical calling syntax. You wouldn't be able to have a macro and a UDF with the same name, but that's fine - you can't have a seed and a model with the same name either, and they have the same invocation syntax (ref).

So it would look like this:

{{ is_positive_int('maybe_positive_int_column') }} --maybe a macro, maybe a UDF, who cares? (NB: dbt knows at parse time so it's still deterministic)
{{ dbt_utils.is_positive_int('maybe_positive_int_column') }} --raw code came from a package
{{ my_parent_mesh_project.my_companys_proprietary_udf('"a string literal"') }} --a public cross-project UDF which was built by another project and is directly callable by me

I particularly like this because it makes the migration really clean and tidy for anyone who already has hardcoded UDFs in their project - they just have to wrap them in some curlies (and quote their column references 😑) and they're done.

Alternative approach

Failing that, I'd fall back to

{{ udf('is_positive_int')('maybe_positive_int_column') }}
{{ udf('dbt_utils', 'is_positive_int')('maybe_positive_int_column') }}
{{ udf('my_parent_mesh_project', 'my_companys_proprietary_udf')('"a string literal"') }}

The only benefit of that approach is it could be expanded to support versions in the future. I don't think the jankiness of double parens and more quotes are worth it though.

Bonus question

Are we supporting user-defined table functions as well? I assume I could do

select * from table({{ get_employees_in_department('"product"') }})

1 reply

graciegoheen Aug 21, 2025
Maintainer Author

Hey @joellabes - appreciate your opinions and suggested alternative approach. We had a great conversation about this in today's community feedback session (recording here, would recommend a watch!), and I want to push back on you here.

The sentiment from the feedback session was:

refing a UDF feels very natural, you're referring to an object in the database. it's not the function (like macros), it is a reference.
it makes use of an existing concept in dbt, rather than introducing something totally new.
it's "straightforward" and immediately clear what's going on (less "magic"). the () happening outside the reference is exactly what happens when the SQL runs.
for people who do want the benefits of macros (defaults, named args, etc.), they can just wrap the UDF function call inside a macro.

for something that doesn't return a Relation

We document that a Relation is "used to interpolate schema and table names into SQL code with appropriate quoting." For UDFs, it's still interpolating schema, database, & identifier names into SQL quote with appropriate quoting. The only difference here is that UDFs could return something other than a table (though UDTFs do return a table, to your "bonus question"). To me, UDFs are much more similar to other ref-able dbt objects than they are to macros (which inject logic directly into your code).

Hub packages or Mesh projects might like to vend UDFs to their users

I think we can make use of the existing namespacing of ref for this use case (example: {{ ref('project_or_package', 'function_name') }}).

ArielFishLich · 2025-07-29T11:40:08Z

ArielFishLich
Jul 29, 2025

So great to hear—I'm looking forward to it! A few thoughts:

1. UDF Overloading

In Snowflake (and probably in some more engines), it's possible to create overloaded UDFs. For example:

create or replace function my_udf(a varchar);
create or replace function my_udf(a varchar, b int);

Both functions will coexist in the schema. That means if you reference my_udf (e.g. via ref), you might accidentally invoke an older version that's no longer maintained in your dbt project. A possible mitigation is to perform garbage collection when building UDFs—i.e. remove previous overloads to enforce a one-version rule.

2. Support for Table Functions (UDFs as Parameterized Views)

Table functions are essentially parameterized views. +1 for their support - it enables a wide range of use cases, definitely a welcome addition!

3. Schema Validation During UDF Creation

In Snowflake, UDFs are validated at creation time. For instance:

create or replace temp function my_udf() returns int as $$
  select * from my_tbl
$$

If my_tbl does not exist in the schema, Snowflake throws a compilation error like:

Object '<db>.<schema>.MY_TBL' does not exist or not authorized.

This highlights an advantage of using dbt build over dbt function: dbt build can enforce proper dependency ordering—ensuring that referenced models are created before the UDF creation step, while dbt function can not build these models.

4. UDFs and Unit Testing Challenges

Testing downstream models that depend on UDFs is tricky:

For non-table (scalar) UDFs, unit tests require the UDF to be built in the schema before testing, which breaks typical unit-testing isolation.
For table UDFs that depend on other models, you'd need to mock both the input data and UDF logic dynamically - a complex and fragile setup.

I am not sure how it's best to tackle it - maybe just acknowledging that current dbt unit testing does not support these patterns.

5. Named Arguments and Default Values in UDF Calls

The idea to wrap UDF input arguments in double braces ({{…}}) has another advantage - This could pave the way for supporting default values and named arguments in UDF calls.

3 replies

james-johnston-thumbtack Aug 2, 2025

UDFs and Unit Testing Challenges

I am running into this problem with Paul Brabban's custom UDF materialization. Somebody found a workaround involving a custom ref_udf macro and then overriding it for the unit test, but it's quite hacky: #10395 (comment)

I feel that the problem might be simply solved by allowing unit tests to break isolation if the user explicitly chooses to allow it. For example:

    given:
      - input: ref('is_positive_int')
        format: passthrough

The format: passthrough (or some other flag) would tell the unit testing parser NOT to mock this particular input. Instead, the unit test node would depend on the actual is_positive_int node in the DAG, which would have to be built first. (I guess it makes it sort of a hybrid data test / unit test?)

It might be less ideal because it isn't "perfect" isolation. But I am a practical person. Some unit testing with partial mocking is better than no unit testing at all, and not being able to mock a UDF input isn't that big a deal for the current use case I am working on.

Or in other words, not being able to use dbt unit tests because my data model happens to need a UDF would be quite unfortunate. (Or worse, having to delete existing unit tests because I need to add a UDF to an existing data model.)

ArielFishLich Aug 2, 2025

Sounds very reasonable - but that won't work for cases when the UDF reference a relation (very common with table UDFs) - I mean technically it could work but dependency between actual data and unit test is super weird

james-johnston-thumbtack Aug 2, 2025

I mean technically it could work but dependency between actual data and unit test is super weird
Yeah... that's what I meant by: "I guess it makes it sort of a hybrid data test / unit test?" 😂

If the UDF is actually querying real data though, I wonder if at that point it would make better sense to mock the UDF call itself. That's what I would do in regular software if I was calling some other complex external thing: mock it out. In the Go code I work on, sometimes I mock calls to other packages, and sometimes I don't. I don't have to be forced to mock 100% everything or 100% nothing.

So, e.g.

Don't mock function calls to simple UDFs where you'd just as soon have the real UDF behavior. For example, we have a BigQuery UDF that, given a JSON object, returns an array of key-value pairs. No reason to mock out a call to this UDF. (You'd think this would be easy to do with built-in JSON functions, but it's not.)
If the function has non-deterministic behavior, calls an external system, reads from other SQL tables (the example you gave), etc. then it's a strong candidate for mocking out the calls to the UDF.

As for how to define the mock calls to a UDF, that would be an open question. There is already a very simplistic mechanism for overriding macros that gives an example. Although I find that this is too simplistic, and the macro mocking/overriding needs further development IMHO. I guess my issues with unit test macro overriding would mostly be a separate issue from this discussion. But there are a class of improvement ideas for overrides/mocking that I think probably apply equally to both UDFs and Jinja macros:

Ability to define different overrides for different input values. Example where the simple macro overriding isn't enough for Brabban's custom UDF materialization: Support for UDFs as a materialization #10395 (reply in thread)
Ability to assert that the UDF or macro is called with some expected input. (i.e. fail the test for when the UDF/macro is called with unrecognized input parameters for which we don't have an output defined.)
Ability to provide an entire mock UDF or macro implementation specific to the unit test (or potentially shared in a fixture), for when a more complex mock is needed that can't be defined solely with YAML. For a UDF, the unit testing framework could create a temporary UDF.

Really all I am doing here is pulling capabilities from the mockery / testify mock tool/framework I use for mocking Go code.

So e.g. for the example you gave, a mock implementation of the table-valued UDF could be provided as part of the unit test. Instead of querying the real table, it just returns some fake rows. (Not unlike how mock data models are already provided as part of a unit test today.) Multiple result sets could be defined for different possible UDF input values. Ideally this could be done in YAML. This would probably be implemented by the adapter crafting a temporary UDF that does all this. If more complex behavior was needed, then you'd provide an entire new mock UDF implementation yourself.

TL;DR: when a UDF has external dependencies or non-deterministic behavior, have a capability for optionally mocking the UDF by creating a temporary UDF and referencing that instead of the real UDF.

james-johnston-thumbtack · 2025-08-02T06:57:45Z

james-johnston-thumbtack
Aug 2, 2025

@graciegoheen you might find it interesting to see what I did with Paul Brabban's prototype... I added support for additional languages and function types to their UDF custom materialization: #10395 (comment)

0 replies

brabster · 2025-08-05T20:50:27Z

brabster
Aug 5, 2025

Hey thanks for the shout-out @graciegoheen!

Nice write up! I can see there's already been some discussion, but I'll try and give my own feedback and answer the OP questions with reasoning rather than start dialogues...

Quick context: I don't use dbt Cloud, I typically test and run my dbt pipelines off my CI/CD system, not a separate orchestrator, I don't know anything about "Fusion" (SDF?) and I've only used dbt with BigQuery, Athena, Snowflake and DuckDB off the top of my head. I have a few open source projects that are examples of my use of UDFs (might be useful to try out candidates) and I've been involved with data platform infra and building dbt projects with I think around 1500 models & tests, but I prefer smaller more "meshy" setups.

functions/schema.yml

I find that typically if I have one I have multiple UDFs in a schema, so having to put the db and schema in a config block would involve a lot of repetition as a minimum. As far as I know UDFs et al. are contained by schemas, and it makes more sense to me to describe the schema and then the functions inside. Ideally, the defaults are the project schema. Usually, I have some functions in the same schema as specific relations that use them (for example, a UDF that converts a legacy format to the current one). I might have more general functions (like a function that describes a hashing method to be used across the pipeline or wider) too and these would go in a dedicated schema to enable reuse.
returns needs to be optional. BigQuery can take ANY TYPE on args and then may need to infer a return type (like a function that slices an array of ANY TYPE needs to infer the return type from the input type)
functions might make sense to cover UDFs, UDAFs and table functions - although table functions might need a bit of thought as they are a bit different, more like parameterised relations? (never had to use one in anger TBH)
this yml might be useful for describing functions that are more like "sources" - depended upon but managed outside the project, for example this BigQuery community one. I've literally used dbt's sources to do that (works fine, feels a bit weird) and I usually wind up with more than one function per schema in there.

Unit testing

Unit testing should be easy enough to implement for UDFs, but probably needs a bit of thinking about in order to cover other types of functions. Table functions again jump out as a bit different.

My ways of working mean I'm always able to run my tests in a non-production environment so I'm not really affected by the functions being persisted for testing, although the fact that UDFs/UDAFs typically support the TEMPORARY keyword feels like it should enable some neat ways of controlling persistence.

Now I think about it, seems a bit similar to a model's ephemeral materialization?

Open questions

I've not really come across a need for more than one udf per .sql file. I think this comment might be an example of where "inline" functions might be useful but I'd check it really would solve the problem first... maybe one persisted udf and any number of anonymous "inline" functions per file might help? I'd still want to be able to test the inline functions independently though...
I don't really know what the implications of the various options are. I can say that there is some weirdness at the moment with having to return a "relation" instance from my macros, would a new "node type" solve that? Definitely useful to have functions/function tests render more distinctly from relations in the dbt docs visualisation!
I like {{ ref(my_function) }}(data), as it clearly differentiates between the local-dbt-string-interpolation part that takes place at compile time, and the sql-on-warehouse that happens afterward and is subject to the constraints of the warehouse. I can see some of the potential advantages of an alt syntax (I think someone mentioned named and default args) but care needed not so make something confusing or open up a bunch of sharp edges and edge cases. A macro and a persisted function in the database are very different things (AWS Athena deprived me of UDFs and illustrated some similarities and differences)!
I would prefer the option of defining everything alongside the logic, as I do in my impl. I exist in dbt YAML file hell most if the time, repos full of pairs of files thing.sql with the SQL and thing.yml with a description (maybe this is a weird me thing, single .yml files for all the things quickly grow big enough to fry my tiny brain). The pythony function signature looks cool, but feels like there might be challenges in the edge cases somehow...
Inclusion in dbt build is enough. No need to add complexity.
Not sure what the benefit of complicating things with the SDF spec stuff you mentioned (but I don't know anything about SDF!) Certainly mirroring the existing config elements like persist_docs and grants seems sensible (although I'm not sure why persist_docs is a thing and is off by default? Who hates docs that much!? anyway, that's an aside)
Adding configs direct to functions file: I guess so, to allow the option of not having to create a paired YAML file.
I guess, bit weird if the rules for what can go in your logic is different for functions than for other SQL
Other languages - I guess!
I don't really understand, I test these things in SQL CTEs anyway. I'd be careful not to make simply writing plain SQL to test functions in the build hard or impossible, I'm sure I've had cases where I needed to use SQL to generate comprehensive test cases rather than hardcoding them.

I would kind of expect to be able to handle UDFs, UDAFs and stored procedures in a similar way - adding one seems likely to generate clamour for the others. Then there are table functions, someone else mentioned those - they feel a bit different, but I can see the connection. Might be worth bearing the others in mind when you're looking at UDFs. I published a follow up post covering what I've done with these other types of object this week.

2 replies

james-johnston-thumbtack Aug 6, 2025

On (3) I think I also agree.... one of the downsides of alternatives (1) and (2) in the OP is you end up with something like:

{% set data %}
    1 + 5 + 3 + whatever * complex_sql + (SELECT count(*) FROM someplace)
{% endset %}
{{ my_function(data) }}

which to me feels worse and more disjointed than:

{{ ref('my_function') }}(       <---- or {{ udf('my_function') }} --- i.e. alternative 3 in the OP
    1 + 5 + 3 + whatever * complex_sql + (SELECT count(*) FROM someplace)
)

I guess that readability issue is also a problem when using cross-database macros or other dbt Jinja macros like this example: https://docs.getdbt.com/reference/dbt-jinja-functions/cross-database-macros#array_construct ... in those cases it's unavoidable, because it's Jinja and you can't put the Jinja macro parameters in the SQL itself.

But here, there is a choice..... I think the second code snippet above is more readable, but the first code snippet is more consistent with how you call Jinja macros.

joellabes Aug 6, 2025
Collaborator

I would prefer the option of defining everything alongside the logic, as I do in my impl. I exist in dbt YAML file hell most if the time, repos full of pairs of files thing.sql with the SQL and thing.yml with a description (maybe this is a weird me thing, single .yml files for all the things quickly grow big enough to fry my tiny brain). The pythony function signature looks cool, but feels like there might be challenges in the edge cases somehow...

Agreed on all counts! Did someone say YAML front matter?

--functions/is_positive_int.sql

---
description: my UDF that determines if a string represents a positive (+) integer
    config:
      schema: udf_schema
      database: udf_db
    arguments: 
      - name: a_string
        type: string
        description: "the string that I want to check if it's representing a positive integer (like '10')" 
    returns:
      type: boolean

    # likely more properties and/or configs (see open questions below)
---

REGEXP_CONTAINS(a_string, r'^[0-9]+$')

dbeatty10 · 2025-08-06T17:51:30Z

dbeatty10
Aug 6, 2025
Maintainer

Hey folks! Dropping by to let you know a community feedback / office hours session that we'll be running in a couple weeks.

Thursday, 21 August, 8am Pacific: UDFs as native functionality in dbt

We'd love to have you join and hear your feedback!

Some supporting resources:

All the awesome discussion above 😎
May 2025 Roadmap: Out-of-the-box support for UDFs
Epic in GitHub

1 reply

dbeatty10 Aug 21, 2025
Maintainer

Here's the recording:

https://dbtlabs.zoom.us/rec/share/jKict3pX448sskFMds6d9PmTW7QJee6bwsOeUESMDjqd8DqKhV8rNJNCI579DUFW.Hu0ier8URGd67amu
Passcode: =@z^Hl06

wolfram-s · 2025-08-21T04:46:31Z

wolfram-s
Aug 21, 2025
Collaborator

UDFs via Jinja (macro‑compatible proposal)

Thanks to Grace and the community for kicking this off. Code‑authored UDFs are overdue. There are a dozen ways to add them; I think the simplest, most ergonomic path is: Extend dbt’s macro system to also define, configure, and call UDFs.

Why? Macros and functions are both procedural abstractions. You should be able to switch between them with minimal churn—just like you can in C, Rust, or Lisp. Keep one calling convention, one mental model, and reuse dbt’s well tested machinery (dispatch, packaging, docs, CI).

Core idea

Add a new Jinja block: {% function name(args…) %} … {% endfunction %}.

Functions compile to warehouse UDFs; macros still expand at compile time.

One call form everywhere: {{ my_fn(arg1, arg2) }}.

If a macro and a function share a name, we issue a compile error.

Simple example
Existing macro defintion ...

{% macro is_positive_int(a_string) %}
  REGEXP_LIKE({{ a_string }}, '^[0-9]+$')
{% endmacro %}

... and its usage

select * from my_table where {{ is_positive_int('my_column') }}

The same as a UDF looks like this...

{{ config(schema='public_string_functions', database='udf_db', language='sql', kind='scalar') }}

-- signature: (a_string STRING) RETURNS BOOLEAN
{% function is_positive_int(a_string) %}
  REGEXP_LIKE({{ a_string }}, '^[0-9]+$')
{% endfunction %}

... with same usage

select * from my_table where {{ is_positive_int('my_column') }}

Files, config, and signatures

Like anywhere else function definitions support one file‑scoped config() at the top (schema, database, language, runtime, kind, grants, etc.). Additional top‑level config() calls in the same file are errors; split files if you need different settings.

We support two signature styles

Dialect specific function signature sit as comments immediately above the function definition
- -- signature: (x INT) RETURNS INT (or with name: -- signature: abs_i64(x INT) RETURNS INT)
Generic function signature use kw-bindings to describe the signature components, name, args, etc.
- {% signature(args=["x INT"], returns="INT") %} or with name...

If both appear for the same function, error. If a named signature disagrees with the function name, error.

Build order & DAG (no magic, just edges), parse and materialize

dbt parse introduces for each function definition a Function node. In a addition we record each function application in a new depends_on_functions field. parse expands macros and functions into their corresponding syntactic forms.

Function materialization takes the config and the function body and create the proper ``CREATE OR REPLACE FUNCTION` statement .

dbt’s topological sort guarantees: functions are uploaded to the warehouse and created before anything that calls them. Function‑to‑function calls are transitively ordered. Even functions referencing tables is supported with further ado, as long as the db adapters supports them.

Type checking & analysis

With signatures known, dbt/Fusion can typecheck function bodies (SQL) and validate calls at compile time.

This works even for non‑SQL languages (Python/Java/Scala), since all functions must declare signatures so that call sites can be validated.

Non‑SQL example (Snowflake Python)

{{ config(schema='udf_py', database='udf_db', language='python', runtime_version='3.10', kind='scalar', handler='is_positive_int') }}

-- signature: (a_string STRING) RETURNS BOOLEAN
{% function is_positive_int(a_string) %}
  def is_positive_int(a_string: str) -> bool:
      if a_string is None:
          return False
      import re
      return re.fullmatch(r'[0-9]+', a_string) is not None
{% endfunction %}

Aggregates, table functions, and beyond

The function configuration kind supports 'scalar', 'aggregate' and 'table'. Concepts and call sites stay identical. Of course UDF,s UDAs and UDTs are only supported if the db adapter actually supports them.

Known limits (v1, on purpose)

This proposal does not support overloaded UDFs, i.e. UDFs applications that are only distinguishable by their argument signatures. For instance Snowflake has two functions:

-- signature: DATEADD(part STRING, offset INT, date DATE) RETURNS DATE
-- signature: DATEADD(part STRING, offset INT, ts TIMESTAMP) RETURNS TIMESTAMP

Since dbt doesn't do overload resolution at parse time, we can't distinguish these functions. But that restriction can easily be overcome: just define these two functions using distinct names (e.g., dateadd_date, dateadd_ts). Now everything works as expected!

Summary: Why does this proposal work so easily?

No new mental model. UDFs are the same kind of procedural abstraction as macros.
One simple syntax, even for call site. Define with {% function %}; call with {{ fn(...) }}—no ref().
One config. Exactly one file‑scoped function {{ config(...) }} per file—simple and predictable.
Many per file. Define multiple functions in a file; each has its own one‑line signature above it.
Ecosystem reuse. Dispatch, packaging, docs, CI, and editor tooling (go‑to‑definition, references) all work unchanged.

2 replies

graciegoheen Aug 21, 2025
Maintainer Author

Hey @wolfram-s ! Thanks for the helpful feedback !

We did initially consider a spec that was more macro-like (similar to the one you're suggesting).

Our pros for this were:

define multiple functions in one file
define argument names + data types in function signature alongside definition

For the former, we've gotten positive feedback on having 1 file per warehouse object. This is consistent with how other dbt-managed warehouse objects are defined (models, seeds). And UDFs in other languages could have the right file extension (ex: python UDFs could have .py extension).

For the latter, we are also balancing:
a) we must do this in a way that is consistent across core and fusion (so no changes to jinja & no SQL parsing required)
b) we should do this in a way that is consistent / familiar with the rest of the dbt language

A few immediate thoughts on your proposal:

I'm curious how much parsing complexity would be added by your recommendation for dialect specific function signature:

-- signature: (a_string STRING) RETURNS BOOLEAN

Perhaps core would be able to use naive regex "find -- signature: and whatever comes after", but it's something we'd need to investigate.

Additionally, your recommended generic function signature would be inconsistent with how we define macro arguments and types today. And we can't change jinja for core.

{% signature(args=["x INT"], returns="INT") %}

vs.

{% macro cents_to_dollars(column_name, scale=2) %}

Perhaps we could make signature a config (instead of having separate configs for inputs and return types, like what we have for macros today); though core might not then be able to validate that you've defined all arguments, etc. since it can't parse the string you've supplied for signature.

Additionally, we don't have an existing pattern for supporting a top-level config block that applies to all macros defined in a file.

{{ config(schema='public_string_functions', database='udf_db', language='sql', kind='scalar') }}

...

The closest we have is setting configs for custom generic data tests like:

{% test my_test() %}

    {{ config(severity = 'warn') }}

    select ...

{% endtest %}

For those reasons, along with making it simpler for folks to lift and shift legacy code, we initally went with our proposed spec above - where you define each UDF in a single file (retains the 1:1 relationship with file & dbt-managed warehouse object and makes the files easier to parse) and then define the args & data types in YML. That being said, we're interested in supporting YML frontmatter (see issue here) to allow folks to define these in a single file. I would be curious to hear your thoughts on that:

--- # essentially treat everything between these lines as yaml
config:
  schema='public_string_functions'
  database='udf_db'
---

...

As far as calling the function, we also considered having one call form like {{ my_fn(arg1, arg2) }}. And I like your suggestion that "if a macro and a function share a name, we issue a compile error." While this is clean, there were a few reasons we decided to not go this route initially:

this is different from what calling a macro does - the macro injects the literal logic, the UDF would just inject the fully-qualified name of the UDF (much more similar to ref)
we (dbt maintainers) don't have a protected namespace for jinja functions. So, right now dbt has ref and source. But if in the future, we wanted to add function we run the risk that someone already has a macro (or UDF) named function in their project causing a conflict
we got positive feedback from community members on ref'ing UDFs (treating them as dbt-managed DWH objects vs. conflating them with macros)
ref automatically gets us cross-project references and respects generate_schema / generate_database / generate_alias_name macros

Though, in an ideal world, I'd also love to be able to support "named and default args" - which {{ ref(...) }}(args) doesn't do.

Lots to consider and investigate here so thank you for the writeup! We're working on the initial prototype now, so nothing is set in stone :)

wizardxz Aug 21, 2025
Collaborator

I'm curious how much parsing complexity would be added by your recommendation for dialect specific function signature:

-- signature: (a_string STRING) RETURNS BOOLEAN

In fusion, it is trivial. Our intern project, jinja typechecking, has already implemented signature parsing. e.g. https://github.com/dbt-labs/fs/blob/main/crates/dbt-cli/tests/data/jinja_typecheck/macros/test_var_definition.sql#L1

We can have the similar thing. But I like your idea of using config to write function signature.

graciegoheen · 2025-08-28T20:36:22Z

graciegoheen
Aug 28, 2025
Maintainer Author

Thank you for all of the feedback so far! I wanted to provide one update to how we're planning to implement UDFs in dbt.

Originally, we proposed referencing a UDF using the existing {{ ref(...) }} macro. While that made a lot of sense to us (and others) because:

it's intuitive / straightforward
doesn't require learning a new syntax (makes use of an existing concept)

We got some great feedback from our engineers (thank you @wolfram-s, @QMalcolm, @akbog, @joellabes) that even though UDFs may seem very similar to other dbt-managed objects (like models, snapshots, and seeds) there's a key distinction:

{{ ref(...) }} returns a Relation whereas a UDF is executable code

UDFs:

live in the function namespace
are called differently with appended () for arguments
typically have USAGE/EXECUTE privileges

Because UDFs are actually conceptually different from Relations, we believe that "referencing" these should be distinct in your code (i.e. you should be able to easily scan your model files and see when you're calling a Relation vs. a Callable).

So, instead of moving forward with {{ ref(...) }}, we've decided to update the spec so that you reference UDFs by using a new jinja function: {{ function(...) }} that returns a Callable.

We will need to consider how much overlap there is between function and ref (examples: generate_schema / database / alias macros, cross-project calls, sampling for UDTFs, etc.). We welcome any thoughts you have!

1 reply

eliasdefaria Sep 1, 2025
Collaborator

Big shoutout @graciegoheen! Nice work on this spec :))

Out-of-the-box support for UDFs #11851

Uh oh!

Uh oh!

graciegoheen Jul 22, 2025 Maintainer

Background

Proposed Spec for UDFs

Open Questions

Let’s. Finally. Do. It.

Replies: 8 comments · 10 replies

Uh oh!

Uh oh!

joellabes Jul 23, 2025 Collaborator

Top level opinions:

My pitch

Alternative approach

Bonus question

Uh oh!

Uh oh!

graciegoheen Aug 21, 2025 Maintainer Author

Uh oh!

1. UDF Overloading

2. Support for Table Functions (UDFs as Parameterized Views)

3. Schema Validation During UDF Creation

4. UDFs and Unit Testing Challenges

5. Named Arguments and Default Values in UDF Calls

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

functions/schema.yml

Unit testing

Open questions

Uh oh!

Uh oh!

joellabes Aug 6, 2025 Collaborator

Uh oh!

Uh oh!

dbeatty10 Aug 6, 2025 Maintainer

Uh oh!

dbeatty10 Aug 21, 2025 Maintainer

Uh oh!

wolfram-s Aug 21, 2025 Collaborator

UDFs via Jinja (macro‑compatible proposal)

Uh oh!

Uh oh!

graciegoheen Aug 21, 2025 Maintainer Author

Uh oh!

wizardxz Aug 21, 2025 Collaborator

Uh oh!

graciegoheen Aug 28, 2025 Maintainer Author

Uh oh!

eliasdefaria Sep 1, 2025 Collaborator

graciegoheen
Jul 22, 2025
Maintainer

Replies: 8 comments 10 replies

joellabes
Jul 23, 2025
Collaborator

graciegoheen Aug 21, 2025
Maintainer Author

joellabes Aug 6, 2025
Collaborator

dbeatty10
Aug 6, 2025
Maintainer

dbeatty10 Aug 21, 2025
Maintainer

wolfram-s
Aug 21, 2025
Collaborator

graciegoheen Aug 21, 2025
Maintainer Author

wizardxz Aug 21, 2025
Collaborator

graciegoheen
Aug 28, 2025
Maintainer Author

eliasdefaria Sep 1, 2025
Collaborator