Skip to content
Merged
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 71 additions & 0 deletions crates/iceberg/src/arrow/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -378,7 +378,24 @@ impl ArrowSchemaVisitor for ArrowSchemaConverter {
DataType::Int8 | DataType::Int16 | DataType::Int32 => {
Ok(Type::Primitive(PrimitiveType::Int))
}
// Cast unsigned types based on bit width (following Python implementation)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Cast unsigned types based on bit width (following Python implementation)
// Cast unsigned types based on bit width to allow for no data loss

I'm sure python compatibility is a direct goal here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will incorporate this

DataType::UInt8 | DataType::UInt16 | DataType::UInt32 => {
// Cast to next larger signed type to prevent overflow
let bit_width = p.primitive_width().unwrap_or(0) * 8; // Convert bytes to bits
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems superflous, can't you just match on the data types and directly map them?

DataType::UInt8 | DataType::UInt16 => Ok(Type::Primitive(PrimitiveType::Int)
DataType::UInt32 => Ok(Type::Primitive(PrimitiveType::Long)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, will make this logic simple

match bit_width {
8 | 16 => Ok(Type::Primitive(PrimitiveType::Int)), // uint8/16 → int32
32 => Ok(Type::Primitive(PrimitiveType::Long)), // uint32 → int64
_ => Ok(Type::Primitive(PrimitiveType::Int)), // fallback
}
}
DataType::Int64 => Ok(Type::Primitive(PrimitiveType::Long)),
DataType::UInt64 => {
// Block uint64 - no safe casting option
Err(Error::new(
ErrorKind::DataInvalid,
"UInt64 is not supported. Use Int64 for values ≤ 9,223,372,036,854,775,807 or Decimal(20,0) for full uint64 range.",
))
}
DataType::Float32 => Ok(Type::Primitive(PrimitiveType::Float)),
DataType::Float64 => Ok(Type::Primitive(PrimitiveType::Double)),
DataType::Decimal128(p, s) => Type::decimal(*p as u32, *s as u32).map_err(|e| {
Expand Down Expand Up @@ -1717,6 +1734,60 @@ mod tests {
}
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this probably isn't the right module, but it would probably be nice to have a test that actually exercises writing these types and then reading them back again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I implemented an integration test for unsigned type roundtrip, but discovered that ParquetWriter also requires modification to handle unsigned data conversion. The issue stems from a type mismatch between schema and data.

The problem occurs because schema conversion (arrow_schema_to_schema) transforms the metadata but leaves the actual data unchanged. When writing, Arrow validation fails due to this mismatch.

Copy link
Contributor

@CTTY CTTY Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think writing record batches that contain unsigned types is out of the scope of the original issue and can be tricky:

  • ParquetWriter uses AsyncArrowWriter under the hood
  • AsyncArrowWriter uses an arrow schema that got converted from the Iceberg table schema
  • When converting an Iceberg schema to arrow schema, arrow schema won't have any unsigned types (and I don't think it makes sense to do so unless there is a valid use case)
  • Schema mismatch between record batches and the arrow schema, arrow writer will fail

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, from the original issue it seems scope is ambiguous. It seems like this change it makes it possible to create a schema from arrow with unsigned types which might be helpful by itself, but imagine the next thing the user would want to do is actually the write the data?

It seems fine to check this in separately as long as there is a clean failure for the unsigned types (i.e. we don't silently lose data).

fn test_unsigned_type_casting() {
// Test UInt32 → Int64 casting
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it possible to maybe parameterize the non-error cases as least with expected input/output to avoid boiler plate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

{
let arrow_field = Field::new("test", DataType::UInt32, false).with_metadata(
HashMap::from([(PARQUET_FIELD_ID_META_KEY.to_string(), "1".to_string())]),
);
let arrow_schema = ArrowSchema::new(vec![arrow_field]);

let iceberg_schema = arrow_schema_to_schema(&arrow_schema).unwrap();

// Verify UInt32 was cast to Long (int64)
let iceberg_field = iceberg_schema.as_struct().fields().first().unwrap();
assert!(matches!(
iceberg_field.field_type.as_ref(),
Type::Primitive(PrimitiveType::Long)
));
}

// Test UInt8/UInt16 → Int32 casting
{
let arrow_field = Field::new("test", DataType::UInt8, false).with_metadata(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: this doesn't look like this is testing UInt16?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added this scenario

HashMap::from([(PARQUET_FIELD_ID_META_KEY.to_string(), "1".to_string())]),
);
let arrow_schema = ArrowSchema::new(vec![arrow_field]);

let iceberg_schema = arrow_schema_to_schema(&arrow_schema).unwrap();

// Verify UInt8 was cast to Int (int32)
let iceberg_field = iceberg_schema.as_struct().fields().first().unwrap();
assert!(matches!(
iceberg_field.field_type.as_ref(),
Type::Primitive(PrimitiveType::Int)
));
}

// Test UInt64 blocking
{
let arrow_field = Field::new("test", DataType::UInt64, false).with_metadata(
HashMap::from([(PARQUET_FIELD_ID_META_KEY.to_string(), "1".to_string())]),
);
let arrow_schema = ArrowSchema::new(vec![arrow_field]);

let result = arrow_schema_to_schema(&arrow_schema);
assert!(result.is_err());
assert!(
result
.unwrap_err()
.to_string()
.contains("UInt64 is not supported")
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think the brackets here and L1755 are excessive

}

#[test]
fn test_datum_conversion() {
{
Expand Down
Loading