Add array_transform function #17289

timsaucer · 2025-08-22T14:32:41Z

Which issue does this PR close?

Closes Implement method to apply scalar or aggregate function to Array elements #15882.

Rationale for this change

There are many use cases where you have a column of data that contains an array and you want to transform every element in that array. The current work around is to do something like unnest and then aggregate. This is bad from both ergonomics and performance. With this work we add a function array_transform that will take a scalar function and apply it to every element in an array.

This PR is narrowly scoped as a first proof of concept. It does not address aggregation as #15882 requests and it is limited in scope to cases where all other variables passed to the inner function must be scalar values.

What changes are included in this PR?

Adds array_transform and unit tests.

Are these changes tested?

Unit test provided that demonstrates both low level testing of the invocation and also a full test demonstrating it in operation with a dataframe.

Here is an example taken from the test that is included in the PR:

let udf = array_transform_udf(datafusion_functions::math::abs(), 0);
let df = df.select([col("a"), udf.call(vec![col("a")]).alias("abs(a[])")])?;

Will produce this dataframe, which shows the original data and transformed:

+-------------+-----------+
| a           | abs(a[])  |
+-------------+-----------+
| [1, -2, 3]  | [1, 2, 3] |
| [-4, 5]     | [4, 5]    |
| [-6, 7, -8] | [6, 7, 8] |
+-------------+-----------+

Are there any user-facing changes?

No

Still to do before ready to merge

Add additional documentation describing how all the pieces of this work
Create a plan for how to expand beyond other variables requiring to be scalar values
Create a plan for addressing the aggregation case or open an issue for something like array_aggregate
Address how it can be used with SQL commands instead of only dataframe operations
Potentially move the integrated test to a different location - dataframe may not be the right place to test a function

timsaucer · 2025-08-23T13:00:23Z

Note to self: To support extending beyond just scalar values for anything other than the array to apply to, see if we can use run end encoded array. That would prevent a need to do any kind of data duplication. We would only need to make one new array that was built up from the lengths of the lists of the array we are applying to.

Also consider having a .new() function that only takes the scalar function and assumes we apply to the 0 index. Then add a .with_argument_index() to change it.

timsaucer · 2025-08-24T18:09:39Z

After experimenting a little more I can see two paths forward for supporting cases like this dataframe:

+--------------+--------------+
| a            | b            |
+--------------+--------------+
| 0.1111111111 | [1, 2, 3]    |
| 0.2222222222 |              |
|              | [4, 5, 6, 7] |
| 0.4444444444 | []           |
+--------------+--------------+

Suppose I wanted to do a round call where I am passing column a as the value to round and column b as the number of decimal places I want to round to. Ultimately I want this to give an output like

+--------------+--------------+--------------------+
| a            | b            | round(a, b[])      |
+--------------+--------------+--------------------+
| 0.1111111111 | [1, 2, 3]    | [0.1, 0.11, 0.111] |
| 0.2222222222 |              |                    |
|              | [4, 5, 6, 7] |                    |
| 0.4444444444 | []           | []                 |
+--------------+--------------+--------------------+

A difficulty here is that we need to map the entries of a multiple times to the b. It appears the best way to do this is to use run end encoding. Then we could keep the ArrayRef for the column a and create a small primitive array of indices [3, 4, 8, 9] that should give us an array that will have the same length as the values array of column b.

I have tested this locally but I run into the problem that the existing scalar functions do not handle run end encoded arrays. All of these functions would need to be implemented, as well as any UDFs that customers create.

An alternative way we could do this would be to create a new array for b and simply duplicate the data as many times as necessary. This feels like it could lead to excessive memory consumption as we are duplicating values just to feed them into a function and throw them away afterwards. Yet it has the advantage that it would immediately support all scalar functions we have with no additional work.

I'm a bit torn on this. I also have alternative reasons to wish to have additional REE support throughout DataFusion. So pushing the first approach would lead to long term benefit but have a much longer tail of implementation.

timsaucer · 2025-08-24T18:10:44Z

Related PRs for run end encoding:

#16446
#16715

timsaucer · 2025-08-30T20:08:20Z

I think the best way forward is to merge this in with the limited support for only processing a single listarray and only literal/scalar other values. Then we open a separate issue for how we want to expand to handle arrays in the other argument positions. My reasoning is that what we have here I believe gives some immediate value, especially for cases where we have single input functions. Then we can make a design decision about how to expand it to work with multi-argument functions that get a ColumnarValue::Array input.

I suspect that the run end encoding is the way to go, but that likely means an epic about building out our functions to support REE. That is overall a good thing, IMO.

milenkovicm · 2025-08-31T19:30:49Z

Tim, would we be able to implement lambda functions on top of this PR or more work needed?

Something like:

SELECT array_sort(array('Hello', 'World'),
  (p1, p2) -> CASE WHEN p1 = p2 THEN 0
              WHEN reverse(p1) < reverse(p2) THEN -1
              ELSE 1 END);

https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-lambda-functions

timsaucer · 2025-09-01T13:00:23Z

Tim, would we be able to implement lambda functions on top of this PR or more work needed?

Something like:
SELECT array_sort(array('Hello', 'World'),
  (p1, p2) -> CASE WHEN p1 = p2 THEN 0
              WHEN reverse(p1) < reverse(p2) THEN -1
              ELSE 1 END);
https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-lambda-functions

I hadn't considered that, but it sounds like an excellent idea to support. I'm going to think some more about it. I definitely think this would be a strong supporting argument for doing this.

vegarsti

FWIW Spark's version of this is called transform, not array_transform. But looks like there's a convention of having an array_ prefix on (most!) array functions (https://datafusion.apache.org/user-guide/sql/scalar_functions.html), which seems reasonable!

vegarsti · 2025-09-01T07:02:15Z

datafusion/common/src/cast.rs

+// Downcast Array to ListViewArray
+pub fn as_list_view_array(array: &dyn Array) -> Result<&ListViewArray> {
+    Ok(downcast_value!(array, ListViewArray))
+}
+
+// Downcast Array to LargeListViewArray
+pub fn as_large_list_view_array(array: &dyn Array) -> Result<&LargeListViewArray> {
+    Ok(downcast_value!(array, LargeListViewArray))
+}


Docstrings need triple quotes! ... Although I see that the ones around this also don't have docstrings.

Suggested change

// Downcast Array to ListViewArray

pub fn as_list_view_array(array: &dyn Array) -> Result<&ListViewArray> {

Ok(downcast_value!(array, ListViewArray))

}

// Downcast Array to LargeListViewArray

pub fn as_large_list_view_array(array: &dyn Array) -> Result<&LargeListViewArray> {

Ok(downcast_value!(array, LargeListViewArray))

}

/// Downcast Array to ListViewArray

pub fn as_list_view_array(array: &dyn Array) -> Result<&ListViewArray> {

Ok(downcast_value!(array, ListViewArray))

}

/// Downcast Array to LargeListViewArray

pub fn as_large_list_view_array(array: &dyn Array) -> Result<&LargeListViewArray> {

Ok(downcast_value!(array, LargeListViewArray))

}

vegarsti · 2025-09-01T07:06:21Z

datafusion/functions-nested/src/transform.rs

+
+#[user_doc(
+    doc_section(label = "Array Transform"),
+    description = "Transform every element of an array according to a scalar function. This work is under development and currently only supports passing a single ListArray as input to the inner function. Other inputs must be scalar values.",


Out of curiosity, what other scenarios are there? This comment seems to indicate it's common for such functions to support multiple lists

There's an example in this comment where you would want to pass two different columns of data. One column has an array of primitive data and the other has an array of list array.

Another use case would be where you have two list arrays and you want to do an element by element application.

timsaucer · 2025-09-03T15:47:58Z

FWIW Spark's version of this is called transform, not array_transform. But looks like there's a convention of having an array_ prefix on (most!) array functions (https://datafusion.apache.org/user-guide/sql/scalar_functions.html), which seems reasonable!

Thanks. Yes, I'm very familiar with transform in Spark. I was trying to follow the DataFusion convention rather than Spark's.

github-actions bot added the core Core DataFusion crate label Aug 22, 2025

timsaucer self-assigned this Aug 22, 2025

github-actions bot added the common Related to common crate label Aug 30, 2025

timsaucer added 3 commits August 30, 2025 15:59

Add in test and resolve errors related to return types

0342e5c

Add unit tests

e15211f

Support fixed size list array

492cf81

timsaucer force-pushed the feat/transform-array branch from c84216c to 492cf81 Compare August 30, 2025 19:59

timsaucer marked this pull request as ready for review August 30, 2025 20:09

timsaucer marked this pull request as draft August 30, 2025 20:10

vegarsti reviewed Sep 1, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add array_transform function #17289

Add array_transform function #17289

Uh oh!

timsaucer commented Aug 22, 2025 •

edited

Loading

Uh oh!

timsaucer commented Aug 23, 2025

Uh oh!

timsaucer commented Aug 24, 2025

Uh oh!

timsaucer commented Aug 24, 2025

Uh oh!

timsaucer commented Aug 30, 2025

Uh oh!

milenkovicm commented Aug 31, 2025

Uh oh!

timsaucer commented Sep 1, 2025

Uh oh!

vegarsti left a comment

Uh oh!

vegarsti Sep 1, 2025

Uh oh!

vegarsti Sep 1, 2025

Uh oh!

timsaucer Sep 3, 2025

Uh oh!

timsaucer commented Sep 3, 2025

Uh oh!

Uh oh!

Add array_transform function #17289

Are you sure you want to change the base?

Add array_transform function #17289

Uh oh!

Conversation

timsaucer commented Aug 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Still to do before ready to merge

Uh oh!

timsaucer commented Aug 23, 2025

Uh oh!

timsaucer commented Aug 24, 2025

Uh oh!

timsaucer commented Aug 24, 2025

Uh oh!

timsaucer commented Aug 30, 2025

Uh oh!

milenkovicm commented Aug 31, 2025

Uh oh!

timsaucer commented Sep 1, 2025

Uh oh!

vegarsti left a comment

Choose a reason for hiding this comment

Uh oh!

vegarsti Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

vegarsti Sep 1, 2025

Choose a reason for hiding this comment

Uh oh!

timsaucer Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

timsaucer commented Sep 3, 2025

Uh oh!

Uh oh!

timsaucer commented Aug 22, 2025 •

edited

Loading