-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Expose and generalize cast_column to enable struct → struct casting in more contexts #17281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…nd schema adapter
I unfortunately will be out for the weekend and can't review until next week but marked myself for review and will look then. @kosiew are you able to integrate this into |
Yes, I can, after all the kinks in this PR is ironed out. |
…ting struct compatibility
- Modified the input array in `test_cast_column_with_options` to include `i64::MAX` to test casting behavior with maximum integer value. - Updated the options for safe casting to reflect correct behavior: test now checks for an error when safe is true and the value cannot be cast. - Added assertions to ensure that the second value in the resulting `Int32Array` is null instead of retaining an invalid state. - Added a println statement in an existing test to output error messages for better debugging.
…column function - Added a validation step to ensure field compatibility between source and target structs before casting. - Enhanced error handling in the casting of struct fields to provide more context in case of failure.
- Updated `build_statistics_record_batch` to handle cases where the input array is of type `Binary` and the statistics field is `Utf8`. - Introduced a new test `test_build_statistics_invalid_utf8_input` to verify proper handling of invalid UTF-8 byte sequences during conversion.
Removed conditional compilation of the tests module to allow tests to run without requiring the `parquet` feature. This change simplifies the test setup and ensures that all tests are executable regardless of feature flags.
Removed a debug print statement from the test case that displayed an error message when casting a struct field failed. This cleanup improves the clarity of the test's output.
Thank you I will try to review this PR today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is amazing work @kosiew! Thank you as always for adding tests, comments and documentation.
I know this is still in draft so I did not review in detail, but the size of the PR also makes it very hard to review and keep track of what are new functions, API changes, etc. Is there any way we can split this up into smaller PRs? Isolating issue fixes, doc updates, adding tests for existing functionality and having separate PRs for new APIs or API changes + tests is super helpful in reviewing: most the the former are quick approvals while the latter requires more in depth review; if split out we can push the easy parts through and parallelize, otherwise it all gets held up together, conflicts arise, scope creeps, etc.
// Ensure nullability is compatible. It is invalid to cast a nullable | ||
// source field to a non-nullable target field as this may discard | ||
// null values. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this means that you can't cast null values into a non-nullable schema. Is that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
ColumnarValue::Array(array) => match cast_type { | ||
// fix https://github.com/apache/datafusion/issues/17285 | ||
DataType::Struct(_) => { | ||
let field = Field::new("", cast_type.clone(), true); | ||
Ok(ColumnarValue::Array(cast_column( | ||
array, | ||
&field, | ||
&cast_options, | ||
)?)) | ||
} | ||
_ => Ok(ColumnarValue::Array(kernels::cast::cast_with_options( | ||
array, | ||
cast_type, | ||
&cast_options, | ||
)?)), | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! Would it be possible to isolate this fix into it's own PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed this change.
…nd pruning_predicate modules - Corrected documentation comment to remove unnecessary brackets around `cast_options` in nested_struct.rs - Removed unused import for `validate_struct_compatibility` in pruning_predicate.rs to improve code cleanliness.
…idate_struct_compatibility when both source and target fields are Struct, ensuring casting errors surface early
@kosiew -- is there any way to break this PR into smaller PRs? It is very challenging to review large PRs (as it requires a large amount of contiguous time). I think @adriangb was also hinting at something similar in #17281 (comment) |
Which issue does this PR close?
cast_column
to enable struct → struct casting in more contexts #16579Rationale for this change
Different data sources often evolve schemas independently of the query’s expected schema.
To make DataFusion more resilient, we need to coerce nested
Struct
columns between file and table schemas in a consistent and configurable way.Previously,
cast_column
was internal and limited in scope. This PR exposes it publicly, generalizes its use to schema adaptation and pruning statistics, and allows struct → struct casting with options (e.g. safe vs unsafe casting).The signature change of
reflects that the function either validates successfully or returns a detailed error. The boolean was redundant —
true
always meant success. Now,Ok(())
communicates success in a more idiomatic Rust style while still carrying precise error context for failures.What changes are included in this PR?
Exposed
cast_column
indatafusion_common
for broader use.Added
CastOptions
support to control strictness and null handling.Extended
cast_column
to recursively walk nestedStruct
arrays, aligning them to target fields, insertingNULL
s for missing fields, and dropping extras.Integrated
cast_column
intoSchemaAdapter
and pruning statistics, enabling schema evolution handling across the stack.Refactored
validate_struct_compatibility
to returnResult<()>
instead ofResult<bool>
.Added comprehensive unit tests for:
Added new user documentation page: Schema Adapter and Column Casting in the library user guide.
Are these changes tested?
✅ Yes.
Are there any user-facing changes?
cast_column
is now a public utility indatafusion_common
.validate_struct_compatibility
now returnsResult<()>
instead ofResult<bool>
. This simplifies usage and is more idiomatic but may require small adjustments in downstream code.