Skip to content

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Aug 22, 2025

Which issue does this PR close?

Rationale for this change

Different data sources often evolve schemas independently of the query’s expected schema.
To make DataFusion more resilient, we need to coerce nested Struct columns between file and table schemas in a consistent and configurable way.

Previously, cast_column was internal and limited in scope. This PR exposes it publicly, generalizes its use to schema adaptation and pruning statistics, and allows struct → struct casting with options (e.g. safe vs unsafe casting).

The signature change of

-) -> Result<bool>
+) -> Result<()>

reflects that the function either validates successfully or returns a detailed error. The boolean was redundant — true always meant success. Now, Ok(()) communicates success in a more idiomatic Rust style while still carrying precise error context for failures.

What changes are included in this PR?

  • Exposed cast_column in datafusion_common for broader use.

  • Added CastOptions support to control strictness and null handling.

  • Extended cast_column to recursively walk nested Struct arrays, aligning them to target fields, inserting NULLs for missing fields, and dropping extras.

  • Integrated cast_column into SchemaAdapter and pruning statistics, enabling schema evolution handling across the stack.

  • Refactored validate_struct_compatibility to return Result<()> instead of Result<bool>.

  • Added comprehensive unit tests for:

    • Casting primitive and struct fields
    • Nullability mismatches
    • Nested structs, arrays, and maps
    • Reordered or missing fields
    • Error scenarios (incompatible types, unsafe casts)
  • Added new user documentation page: Schema Adapter and Column Casting in the library user guide.

Are these changes tested?

✅ Yes.

  • Extensive new unit tests cover casting behavior, validation rules, and error cases.
  • Integration tests verify schema mapping and pruning statistic casting.
  • Doc examples compile and demonstrate real-world usage.

Are there any user-facing changes?

  • New API: cast_column is now a public utility in datafusion_common.
  • Schema Adapter behavior: Can now adapt nested struct fields across evolving schemas.
  • Docs: New guide explains schema adaptation and performance trade-offs.
  • Minor API change: validate_struct_compatibility now returns Result<()> instead of Result<bool>. This simplifies usage and is more idiomatic but may require small adjustments in downstream code.

@kosiew kosiew changed the title Expose and generalize cast_column to enable struct → struct casting in more contexts DRAFT Expose and generalize cast_column to enable struct → struct casting in more contexts Aug 22, 2025
@github-actions github-actions bot added documentation Improvements or additions to documentation common Related to common crate datasource Changes to the datasource crate labels Aug 22, 2025
@kosiew kosiew changed the title DRAFT Expose and generalize cast_column to enable struct → struct casting in more contexts Expose and generalize cast_column to enable struct → struct casting in more contexts Aug 22, 2025
@kosiew kosiew marked this pull request as ready for review August 22, 2025 07:22
@adriangb adriangb self-requested a review August 22, 2025 17:02
@adriangb
Copy link
Contributor

I unfortunately will be out for the weekend and can't review until next week but marked myself for review and will look then. @kosiew are you able to integrate this into PhysicalExprAdapter (https://github.com/apache/datafusion/tree/main/datafusion/physical-expr-adapter )? Since we plan on deprecating SchemaAdatper in favor of PhysicalExprAdapter it's important that the latter can handle the same things.

@kosiew
Copy link
Contributor Author

kosiew commented Aug 26, 2025

@adriangb ,

integrate this into PhysicalExprAdapter (https://github.com/apache/datafusion/tree/main/datafusion/physical-expr-adapter )?

Yes, I can, after all the kinks in this PR is ironed out.

kosiew added 9 commits August 26, 2025 19:29
- Modified the input array in `test_cast_column_with_options` to include `i64::MAX` to test casting behavior with maximum integer value.
- Updated the options for safe casting to reflect correct behavior: test now checks for an error when safe is true and the value cannot be cast.
- Added assertions to ensure that the second value in the resulting `Int32Array` is null instead of retaining an invalid state.
- Added a println statement in an existing test to output error messages for better debugging.
…column function

- Added a validation step to ensure field compatibility between source and target structs before casting.
- Enhanced error handling in the casting of struct fields to provide more context in case of failure.
- Updated `build_statistics_record_batch` to handle cases where the input
  array is of type `Binary` and the statistics field is `Utf8`.
- Introduced a new test `test_build_statistics_invalid_utf8_input` to verify
  proper handling of invalid UTF-8 byte sequences during conversion.
Removed conditional compilation of the tests module to allow tests to run without requiring the `parquet` feature. This change simplifies the test setup and ensures that all tests are executable regardless of feature flags.
Removed a debug print statement from the test case that displayed an error message when casting a struct field failed. This cleanup improves the clarity of the test's output.
@kosiew kosiew changed the title Expose and generalize cast_column to enable struct → struct casting in more contexts DRAFT Expose and generalize cast_column to enable struct → struct casting in more contexts Aug 26, 2025
@kosiew kosiew marked this pull request as draft August 26, 2025 11:33
@adriangb
Copy link
Contributor

@adriangb ,

integrate this into PhysicalExprAdapter (https://github.com/apache/datafusion/tree/main/datafusion/physical-expr-adapter )?

Yes, I can, after all the kinks in this PR is ironed out.

Thank you I will try to review this PR today

@github-actions github-actions bot added the logical-expr Logical plan and expressions label Aug 26, 2025
Copy link
Contributor

@adriangb adriangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is amazing work @kosiew! Thank you as always for adding tests, comments and documentation.

I know this is still in draft so I did not review in detail, but the size of the PR also makes it very hard to review and keep track of what are new functions, API changes, etc. Is there any way we can split this up into smaller PRs? Isolating issue fixes, doc updates, adding tests for existing functionality and having separate PRs for new APIs or API changes + tests is super helpful in reviewing: most the the former are quick approvals while the latter requires more in depth review; if split out we can push the easy parts through and parallelize, otherwise it all gets held up together, conflicts arise, scope creeps, etc.

Comment on lines +211 to +213
// Ensure nullability is compatible. It is invalid to cast a nullable
// source field to a non-nullable target field as this may discard
// null values.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume this means that you can't cast null values into a non-nullable schema. Is that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Comment on lines 213 to 228
ColumnarValue::Array(array) => match cast_type {
// fix https://github.com/apache/datafusion/issues/17285
DataType::Struct(_) => {
let field = Field::new("", cast_type.clone(), true);
Ok(ColumnarValue::Array(cast_column(
array,
&field,
&cast_options,
)?))
}
_ => Ok(ColumnarValue::Array(kernels::cast::cast_with_options(
array,
cast_type,
&cast_options,
)?)),
},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Would it be possible to isolate this fix into it's own PR?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed this change.

kosiew added 6 commits August 27, 2025 12:55
…nd pruning_predicate modules

- Corrected documentation comment to remove unnecessary brackets around `cast_options` in nested_struct.rs
- Removed unused import for `validate_struct_compatibility` in pruning_predicate.rs to improve code cleanliness.
…idate_struct_compatibility when both source and target fields are Struct, ensuring casting errors surface early
@kosiew kosiew changed the title DRAFT Expose and generalize cast_column to enable struct → struct casting in more contexts Expose and generalize cast_column to enable struct → struct casting in more contexts Aug 27, 2025
@kosiew kosiew marked this pull request as ready for review August 27, 2025 07:52
@alamb
Copy link
Contributor

alamb commented Sep 6, 2025

@kosiew -- is there any way to break this PR into smaller PRs? It is very challenging to review large PRs (as it requires a large amount of contiguous time).

I think @adriangb was also hinting at something similar in #17281 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate datasource Changes to the datasource crate documentation Improvements or additions to documentation logical-expr Logical plan and expressions
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Expose and generalize cast_column to enable struct → struct casting in more contexts
3 participants