Skip to content

Conversation

timsaucer
Copy link
Contributor

@timsaucer timsaucer commented Aug 22, 2025

Which issue does this PR close?

Rationale for this change

There are many use cases where you have a column of data that contains an array and you want to transform every element in that array. The current work around is to do something like unnest and then aggregate. This is bad from both ergonomics and performance. With this work we add a function array_transform that will take a scalar function and apply it to every element in an array.

This PR is narrowly scoped as a first proof of concept. It does not address aggregation as #15882 requests and it is limited in scope to cases where all other variables passed to the inner function must be scalar values.

What changes are included in this PR?

Adds array_transform and unit tests.

Are these changes tested?

Unit test provided that demonstrates both low level testing of the invocation and also a full test demonstrating it in operation with a dataframe.

Here is an example taken from the test that is included in the PR:

let udf = array_transform_udf(datafusion_functions::math::abs(), 0);
let df = df.select([col("a"), udf.call(vec![col("a")]).alias("abs(a[])")])?;

Will produce this dataframe, which shows the original data and transformed:

+-------------+-----------+
| a           | abs(a[])  |
+-------------+-----------+
| [1, -2, 3]  | [1, 2, 3] |
| [-4, 5]     | [4, 5]    |
| [-6, 7, -8] | [6, 7, 8] |
+-------------+-----------+

Are there any user-facing changes?

No

Still to do before ready to merge

  • Add additional documentation describing how all the pieces of this work
  • Create a plan for how to expand beyond other variables requiring to be scalar values
  • Create a plan for addressing the aggregation case or open an issue for something like array_aggregate
  • Address how it can be used with SQL commands instead of only dataframe operations
  • Potentially move the integrated test to a different location - dataframe may not be the right place to test a function

@github-actions github-actions bot added the core Core DataFusion crate label Aug 22, 2025
@timsaucer timsaucer self-assigned this Aug 22, 2025
@timsaucer
Copy link
Contributor Author

Note to self: To support extending beyond just scalar values for anything other than the array to apply to, see if we can use run end encoded array. That would prevent a need to do any kind of data duplication. We would only need to make one new array that was built up from the lengths of the lists of the array we are applying to.

Also consider having a .new() function that only takes the scalar function and assumes we apply to the 0 index. Then add a .with_argument_index() to change it.

@timsaucer
Copy link
Contributor Author

After experimenting a little more I can see two paths forward for supporting cases like this dataframe:

+--------------+--------------+
| a            | b            |
+--------------+--------------+
| 0.1111111111 | [1, 2, 3]    |
| 0.2222222222 |              |
|              | [4, 5, 6, 7] |
| 0.4444444444 | []           |
+--------------+--------------+

Suppose I wanted to do a round call where I am passing column a as the value to round and column b as the number of decimal places I want to round to. Ultimately I want this to give an output like

+--------------+--------------+--------------------+
| a            | b            | round(a, b[])      |
+--------------+--------------+--------------------+
| 0.1111111111 | [1, 2, 3]    | [0.1, 0.11, 0.111] |
| 0.2222222222 |              |                    |
|              | [4, 5, 6, 7] |                    |
| 0.4444444444 | []           | []                 |
+--------------+--------------+--------------------+

A difficulty here is that we need to map the entries of a multiple times to the b. It appears the best way to do this is to use run end encoding. Then we could keep the ArrayRef for the column a and create a small primitive array of indices [3, 4, 8, 9] that should give us an array that will have the same length as the values array of column b.

I have tested this locally but I run into the problem that the existing scalar functions do not handle run end encoded arrays. All of these functions would need to be implemented, as well as any UDFs that customers create.

An alternative way we could do this would be to create a new array for b and simply duplicate the data as many times as necessary. This feels like it could lead to excessive memory consumption as we are duplicating values just to feed them into a function and throw them away afterwards. Yet it has the advantage that it would immediately support all scalar functions we have with no additional work.

I'm a bit torn on this. I also have alternative reasons to wish to have additional REE support throughout DataFusion. So pushing the first approach would lead to long term benefit but have a much longer tail of implementation.

@timsaucer
Copy link
Contributor Author

Related PRs for run end encoding:

#16446
#16715

@github-actions github-actions bot added the common Related to common crate label Aug 30, 2025
@timsaucer timsaucer force-pushed the feat/transform-array branch from c84216c to 492cf81 Compare August 30, 2025 19:59
@timsaucer
Copy link
Contributor Author

I think the best way forward is to merge this in with the limited support for only processing a single listarray and only literal/scalar other values. Then we open a separate issue for how we want to expand to handle arrays in the other argument positions. My reasoning is that what we have here I believe gives some immediate value, especially for cases where we have single input functions. Then we can make a design decision about how to expand it to work with multi-argument functions that get a ColumnarValue::Array input.

I suspect that the run end encoding is the way to go, but that likely means an epic about building out our functions to support REE. That is overall a good thing, IMO.

@timsaucer timsaucer marked this pull request as ready for review August 30, 2025 20:09
@timsaucer timsaucer marked this pull request as draft August 30, 2025 20:10
@milenkovicm
Copy link
Contributor

Tim, would we be able to implement lambda functions on top of this PR or more work needed?

Something like:

SELECT array_sort(array('Hello', 'World'),
  (p1, p2) -> CASE WHEN p1 = p2 THEN 0
              WHEN reverse(p1) < reverse(p2) THEN -1
              ELSE 1 END);

https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-lambda-functions

@timsaucer
Copy link
Contributor Author

Tim, would we be able to implement lambda functions on top of this PR or more work needed?

Something like:

SELECT array_sort(array('Hello', 'World'),
  (p1, p2) -> CASE WHEN p1 = p2 THEN 0
              WHEN reverse(p1) < reverse(p2) THEN -1
              ELSE 1 END);

https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-lambda-functions

I hadn't considered that, but it sounds like an excellent idea to support. I'm going to think some more about it. I definitely think this would be a strong supporting argument for doing this.

Copy link

@vegarsti vegarsti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW Spark's version of this is called transform, not array_transform. But looks like there's a convention of having an array_ prefix on (most!) array functions (https://datafusion.apache.org/user-guide/sql/scalar_functions.html), which seems reasonable!

Comment on lines +151 to +159
// Downcast Array to ListViewArray
pub fn as_list_view_array(array: &dyn Array) -> Result<&ListViewArray> {
Ok(downcast_value!(array, ListViewArray))
}

// Downcast Array to LargeListViewArray
pub fn as_large_list_view_array(array: &dyn Array) -> Result<&LargeListViewArray> {
Ok(downcast_value!(array, LargeListViewArray))
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docstrings need triple quotes! ... Although I see that the ones around this also don't have docstrings.

Suggested change
// Downcast Array to ListViewArray
pub fn as_list_view_array(array: &dyn Array) -> Result<&ListViewArray> {
Ok(downcast_value!(array, ListViewArray))
}
// Downcast Array to LargeListViewArray
pub fn as_large_list_view_array(array: &dyn Array) -> Result<&LargeListViewArray> {
Ok(downcast_value!(array, LargeListViewArray))
}
/// Downcast Array to ListViewArray
pub fn as_list_view_array(array: &dyn Array) -> Result<&ListViewArray> {
Ok(downcast_value!(array, ListViewArray))
}
/// Downcast Array to LargeListViewArray
pub fn as_large_list_view_array(array: &dyn Array) -> Result<&LargeListViewArray> {
Ok(downcast_value!(array, LargeListViewArray))
}


#[user_doc(
doc_section(label = "Array Transform"),
description = "Transform every element of an array according to a scalar function. This work is under development and currently only supports passing a single ListArray as input to the inner function. Other inputs must be scalar values.",
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, what other scenarios are there? This comment seems to indicate it's common for such functions to support multiple lists

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an example in this comment where you would want to pass two different columns of data. One column has an array of primitive data and the other has an array of list array.

Another use case would be where you have two list arrays and you want to do an element by element application.

@timsaucer
Copy link
Contributor Author

FWIW Spark's version of this is called transform, not array_transform. But looks like there's a convention of having an array_ prefix on (most!) array functions (https://datafusion.apache.org/user-guide/sql/scalar_functions.html), which seems reasonable!

Thanks. Yes, I'm very familiar with transform in Spark. I was trying to follow the DataFusion convention rather than Spark's.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement method to apply scalar or aggregate function to Array elements
3 participants