-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Add array_transform function #17289
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Add array_transform function #17289
Conversation
Note to self: To support extending beyond just scalar values for anything other than the array to apply to, see if we can use run end encoded array. That would prevent a need to do any kind of data duplication. We would only need to make one new array that was built up from the lengths of the lists of the array we are applying to. Also consider having a |
After experimenting a little more I can see two paths forward for supporting cases like this dataframe:
Suppose I wanted to do a
A difficulty here is that we need to map the entries of I have tested this locally but I run into the problem that the existing scalar functions do not handle run end encoded arrays. All of these functions would need to be implemented, as well as any UDFs that customers create. An alternative way we could do this would be to create a new array for I'm a bit torn on this. I also have alternative reasons to wish to have additional REE support throughout DataFusion. So pushing the first approach would lead to long term benefit but have a much longer tail of implementation. |
c84216c
to
492cf81
Compare
I think the best way forward is to merge this in with the limited support for only processing a single listarray and only literal/scalar other values. Then we open a separate issue for how we want to expand to handle arrays in the other argument positions. My reasoning is that what we have here I believe gives some immediate value, especially for cases where we have single input functions. Then we can make a design decision about how to expand it to work with multi-argument functions that get a I suspect that the run end encoding is the way to go, but that likely means an epic about building out our functions to support REE. That is overall a good thing, IMO. |
Tim, would we be able to implement lambda functions on top of this PR or more work needed? Something like: SELECT array_sort(array('Hello', 'World'),
(p1, p2) -> CASE WHEN p1 = p2 THEN 0
WHEN reverse(p1) < reverse(p2) THEN -1
ELSE 1 END); https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-lambda-functions |
I hadn't considered that, but it sounds like an excellent idea to support. I'm going to think some more about it. I definitely think this would be a strong supporting argument for doing this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW Spark's version of this is called transform
, not array_transform
. But looks like there's a convention of having an array_
prefix on (most!) array functions (https://datafusion.apache.org/user-guide/sql/scalar_functions.html), which seems reasonable!
// Downcast Array to ListViewArray | ||
pub fn as_list_view_array(array: &dyn Array) -> Result<&ListViewArray> { | ||
Ok(downcast_value!(array, ListViewArray)) | ||
} | ||
|
||
// Downcast Array to LargeListViewArray | ||
pub fn as_large_list_view_array(array: &dyn Array) -> Result<&LargeListViewArray> { | ||
Ok(downcast_value!(array, LargeListViewArray)) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Docstrings need triple quotes! ... Although I see that the ones around this also don't have docstrings.
// Downcast Array to ListViewArray | |
pub fn as_list_view_array(array: &dyn Array) -> Result<&ListViewArray> { | |
Ok(downcast_value!(array, ListViewArray)) | |
} | |
// Downcast Array to LargeListViewArray | |
pub fn as_large_list_view_array(array: &dyn Array) -> Result<&LargeListViewArray> { | |
Ok(downcast_value!(array, LargeListViewArray)) | |
} | |
/// Downcast Array to ListViewArray | |
pub fn as_list_view_array(array: &dyn Array) -> Result<&ListViewArray> { | |
Ok(downcast_value!(array, ListViewArray)) | |
} | |
/// Downcast Array to LargeListViewArray | |
pub fn as_large_list_view_array(array: &dyn Array) -> Result<&LargeListViewArray> { | |
Ok(downcast_value!(array, LargeListViewArray)) | |
} |
|
||
#[user_doc( | ||
doc_section(label = "Array Transform"), | ||
description = "Transform every element of an array according to a scalar function. This work is under development and currently only supports passing a single ListArray as input to the inner function. Other inputs must be scalar values.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Out of curiosity, what other scenarios are there? This comment seems to indicate it's common for such functions to support multiple lists
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an example in this comment where you would want to pass two different columns of data. One column has an array of primitive data and the other has an array of list array.
Another use case would be where you have two list arrays and you want to do an element by element application.
Thanks. Yes, I'm very familiar with |
Which issue does this PR close?
Rationale for this change
There are many use cases where you have a column of data that contains an array and you want to transform every element in that array. The current work around is to do something like unnest and then aggregate. This is bad from both ergonomics and performance. With this work we add a function
array_transform
that will take a scalar function and apply it to every element in an array.This PR is narrowly scoped as a first proof of concept. It does not address aggregation as #15882 requests and it is limited in scope to cases where all other variables passed to the inner function must be scalar values.
What changes are included in this PR?
Adds
array_transform
and unit tests.Are these changes tested?
Unit test provided that demonstrates both low level testing of the invocation and also a full test demonstrating it in operation with a dataframe.
Here is an example taken from the test that is included in the PR:
Will produce this dataframe, which shows the original data and transformed:
Are there any user-facing changes?
No
Still to do before ready to merge
array_aggregate