Skip to content

[SPARK-52448][CONNECT] Add simplified Struct Expression.Literal #51561

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

heyihong
Copy link
Contributor

@heyihong heyihong commented Jul 18, 2025

What changes were proposed in this pull request?

This PR adds a new data_type_struct field to the protobuf definition for struct literals in Spark Connect, addressing the ambiguity issues with the existing struct_type field. The changes include:

  1. Protobuf Schema Update: Added a new data_type_struct field of type DataType.Struct to the Literal.Struct message in expressions.proto, while marking the existing struct_type field as deprecated.

  2. Enhanced Struct Conversion Logic: Updated LiteralValueProtoConverter.scala to:

    • Use the new data_type_struct field when available for more precise struct type definition
    • Maintain backward compatibility by still supporting the deprecated struct_type field
    • Add proper field metadata handling in struct conversions
    • Improve type inference for struct fields when data types can be inferred from literal values

Why are the changes needed?

The current Expression.Struct literal is somewhat overcomplicated since it duplicates most of the information its fields already have. This is bulky to send over the wire, and it can be ambiguous.

Does this PR introduce any user-facing change?

No. This PR maintains backward compatibility with existing struct literal implementations. Existing code using the deprecated struct_type field will continue to work without modification.

How was this patch tested?

build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"

Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.2.4

@heyihong heyihong force-pushed the SPARK-52448 branch 3 times, most recently from 5d8e764 to 9d68335 Compare July 18, 2025 20:23
repeated StructField fields = 3;

message StructField {
string name = 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for this improvement. So we will contain the field name in every element, the space complexity would be O(n) compared to O(1) in the deprecated way? So what's the perf gain from this change?

Copy link
Contributor Author

@heyihong heyihong Jul 20, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that the space complexity in the deprecated approach is also O(n), since every element includes the field name in the struct_type field, correct?

Copy link
Contributor Author

@heyihong heyihong Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On second thought, it is not necessary to include the field name/data type in every element, since not setting the data_type field should be sufficient for saving space.

@heyihong
Copy link
Contributor Author

@heyihong heyihong force-pushed the SPARK-52448 branch 6 times, most recently from 1db4711 to f38223d Compare July 21, 2025 17:29
@HyukjinKwon
Copy link
Member

Merged to master.

haoyangeng-db pushed a commit to haoyangeng-db/apache-spark that referenced this pull request Jul 22, 2025
### What changes were proposed in this pull request?

This PR adds a new `data_type_struct` field to the protobuf definition for struct literals in Spark Connect, addressing the ambiguity issues with the existing `struct_type` field. The changes include:

1. **Protobuf Schema Update**: Added a new `data_type_struct` field of type `DataType.Struct` to the `Literal.Struct` message in `expressions.proto`, while marking the existing `struct_type` field as deprecated.

2. **Enhanced Struct Conversion Logic**: Updated `LiteralValueProtoConverter.scala` to:
   - Use the new `data_type_struct` field when available for more precise struct type definition
   - Maintain backward compatibility by still supporting the deprecated `struct_type` field
   - Add proper field metadata handling in struct conversions
   - Improve type inference for struct fields when data types can be inferred from literal values

### Why are the changes needed?

The current Expression.Struct literal is somewhat overcomplicated since it duplicates most of the information its fields already have. This is bulky to send over the wire, and it can be ambiguous.

### Does this PR introduce _any_ user-facing change?

No. This PR maintains backward compatibility with existing struct literal implementations. Existing code using the deprecated `struct_type` field will continue to work without modification.

### How was this patch tested?

`build/sbt "connect/testOnly *LiteralExpressionProtoConverterSuite"`

### Was this patch authored or co-authored using generative AI tooling?

Generated-by: Cursor 1.2.4

Closes apache#51561 from heyihong/SPARK-52448.

Authored-by: Yihong He <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants