Skip to content

Conversation

skalwaghe-56
Copy link
Contributor

Implemented automatic schema merging for collectors that:

  • Unions fields from multiple collect() calls instead of requiring identical schemas
  • Fills missing fields with null values during execution
  • Maintains consistent field ordering (alphabetically sorted by field name)
  • Preserves auto-generated UUID fields across merged schemas

Closes #428.

Copy link
Member

@georgeh0 georgeh0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @skalwaghe-56, sorry for the late reply! I missed it somehow.

Comment on lines +287 to +297
// Prioritize UUID fields by placing them at the beginning for efficiency
fields.sort_by(|a, b| {
let a_is_uuid = matches!(a.value_type.typ, ValueType::Basic(BasicValueType::Uuid));
let b_is_uuid = matches!(b.value_type.typ, ValueType::Basic(BasicValueType::Uuid));

match (a_is_uuid, b_is_uuid) {
(true, false) => std::cmp::Ordering::Less, // UUID fields first
(false, true) => std::cmp::Ordering::Greater, // UUID fields first
_ => a.name.cmp(&b.name), // Then alphabetical
}
});
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite like sorting fields. I think we want to preserve the original order if possible. We only need to add one restriction - if the same field appears on both, they must have consistent in ordering (otherwise raise an error). Then we can merge them without changing the order.

One possible merging approach is (pseudo code, only to show the gist):

let mut output_fields = vec![];
let next_field_id_1 = 0;
let next_field_id_2 = 0;
for (idx, field) in schema2.iter().enumerate() {
  if Some(idx1) = field index in schema1 {
    if (idx1 < next_field_id_1) {
      api_bail!("order mismatch...");
    }
    output_fields.extend(schema1.fields[next_field_id_1..idx1]);
    output_fields.extend(schema2.fields[next_field_id_2..idx]);
    output_fields.push(merged field);
    next_field_id_1 = idx1 + 1;
    next_field_id_2 = idx + 1;
  } else if field is uuid {
    // For UUID, emit it immediately to make sure it still appears first
    ....
  }
}
next_field_id_1.extend(schema1.fields[next_field_id_1..]);
next_field_id_2.extend(schema2.fields[next_field_id_2..]);

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEATURE] collector automatically merge and align multiple collect() called with different schema

2 participants