Skip to content

Conversation

@cecemei
Copy link
Contributor

@cecemei cecemei commented Sep 17, 2025

Description

Add size check in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper in putToKeyBuffer . Without this change, dictionary size is unbounded and can cause historical node OOM. Additionally:

  • add NestedDataTypeStrategy for json column

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@cecemei cecemei requested a review from Copilot September 18, 2025 01:07
@cecemei cecemei marked this pull request as ready for review September 18, 2025 01:08
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds dictionary size estimation to prevent out-of-memory (OOM) issues in historical nodes by implementing size checks before adding entries to dictionaries in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper and its subclasses.

Key changes:

  • Adds abstract estimatedKeySize() method to track memory usage of dictionary entries
  • Implements size estimation for different data types (strings, arrays, structured data)
  • Adds size validation before dictionary insertion to prevent unbounded growth

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

} else {
// fall back to string representation for other types, this might be under-estimating for map
size += (int) estimateStringKeySize(String.valueOf(o));
}
Copy link

Copilot AI Sep 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The method only processes StructuredData objects but ignores other types in the key array. This could lead to underestimating the total size when the key contains non-StructuredData objects.

Suggested change
}
}
} else if (obj != null) {
// For non-StructuredData objects, estimate size using their string representation
size += (int) estimateStringKeySize(String.valueOf(obj));

Copilot uses AI. Check for mistakes.
@cecemei cecemei changed the title Add dictionary size estimate in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper Add size check in DictionaryBuildingSingleValuedRowBasedKeySerdeHelper in putToKeyBuffer Sep 23, 2025

public static final class JsonTypeStrategy implements TypeStrategy
{
private final ObjectStrategy objectStrategy;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: it feels a bit off to me to have to make an ObjectStrategy to implement the TypeStrategy, where ideally I think I would expect both the nested column ObjectStrategy and TypeStrategy to be implemented such that they can just share some common methods instead of one relying on the other. ObjectStrategyComplexTypeStrategy was a sort of backwards compatible fallback to make it so that we could fill in TypeStrategy from pre-existing ObjectStrategy.

Not a big deal though

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved all functions to TypeStrategy, looks better for encapsulation!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Comment on lines 187 to 192
int hashCompare = Long.compare(hash.getAsLong(), o.hash.getAsLong());
if (hashCompare != 0) {
return hashCompare;
}

return Integer.compare(getSizeEstimate(), o.getSizeEstimate());
Copy link

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using only (hash, sizeEstimate) as the final tie-breakers means two unequal values that collide on 64-bit hash and share the same serialized length will compare as 0 and thus be considered equal (see equals implementation), violating the general contract of equals/compareTo. To make equality logically correct, add a final fallback that compares the actual serialized byte arrays when hash and size are identical before returning 0.

Copilot uses AI. Check for mistakes.
Comment on lines 205 to 206
// guarantees that equals is consistent with compareTo
return compareTo(that) == 0;
Copy link

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

equals delegates entirely to compareTo, which (due to the current hash/size tie-breaker) can declare different logical values equal on hash collision plus equal length, creating a possibility of incorrect equality semantics; equals should instead directly compare underlying values (e.g., by serialized bytes or structured value comparison) and only use compareTo for ordering. Recommend implementing equals via structural/byte-array equality and letting compareTo remain a total ordering.

Suggested change
// guarantees that equals is consistent with compareTo
return compareTo(that) == 0;
return java.util.Objects.equals(this.value, that.value);

Copilot uses AI. Check for mistakes.
return Arrays.deepEquals((Object[]) value, (Object[]) that.value);
}
return Objects.equals(value, that.value);
// guarantees that equals is consistent with compareTo
Copy link

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Routing equals through compareTo forces the full ordering path (including hash/size computations and potential serialization) for simple equality checks, adding overhead to hash-based collections; introduce a fast path (e.g., identity, null, primitive/String direct comparison, then serialized-bytes equality) before invoking ordering logic.

Suggested change
// guarantees that equals is consistent with compareTo
// Fast path for null
if (isNull() && that.isNull()) {
return true;
} else if (isNull() || that.isNull()) {
return false;
}
// Fast path for String
if (isString() && that.isString()) {
return asString().equals(that.asString());
}
// Fast path for Number
if (isNumber() && that.isNumber()) {
// Use Double.compare for equality of numbers
return Double.compare(asNumber().doubleValue(), that.asNumber().doubleValue()) == 0;
}
// For complex objects, compare hash and size estimate first
if (hash.getAsLong() != that.hash.getAsLong()) {
return false;
}
if (getSizeEstimate() != that.getSizeEstimate()) {
return false;
}
// Fallback to compareTo for final check (should be rare)

Copilot uses AI. Check for mistakes.
// hashCode that relies on the object equality. Translates the hashcode to an integer as well
public int equalityHash()
{
return Longs.hashCode(hash.getAsLong());
Copy link

Copilot AI Oct 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] hashCode truncates the 64-bit XXHash to 32 bits, increasing collision probability in hash-based collections for large datasets; consider mixing both high and low 32-bit parts (e.g., (int)(h ^ (h >>> 32))) to better distribute entropy.

Suggested change
return Longs.hashCode(hash.getAsLong());
long h = hash.getAsLong();
return (int)(h ^ (h >>> 32));

Copilot uses AI. Check for mistakes.
Comment on lines 139 to 151
if (this.equals(o)) {
if (this == o) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i feel like we should leave this, since StructuredData is also used as a wrapper at ingest time for all 'auto' types, primitive types can short-circuit out of this comparison before forcing stuff to be serialized, which seems like it would add a bit of overhead.

I guess you're trying to make all of equals/hashcode/compareTo like consistently weird the same way with the occasional hash collisions, instead of just isolating the weirdness where it falls to the hash in just the compareTo?

Fwiw, I was trying to keep the strangeness of using the hash instead of performing full byte comparison isolated in compareTo (since ideally we would be performing full comparison), is there a reason related to this PR of why these need to change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought this equal/compareTo is strange since it's not consistent. Equal doesn't handle nested array equal, if there's Object[] in a nested field, they don't equal, even though the compareTo returns equal since they're still the same data. But I've only seen places using compareTo so it might be fine.

Anyway I reverted the compareTo/equals part of change, removed the original hashCode method since it doesn't seem to make sense to keep two hash method. My fix is also not a perfect solution, since: 1. it sort of require object to be serialized even for the equal check now, 2. it still have the issue of hash collision.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yea, it totally is strange, and i wasn't necessarily encouraging reverting it, just was curious if was directly related or not. I think it would be ok to consider making these all consistent with each other, but we can do that in a separate PR since it also doesn't really matter much either way since compareTo is the main thing that is used by stuff.

Comment on lines -110 to -114
@Test
public void testEqualsAndHashcode()
{
EqualsVerifier.forClass(StructuredData.class).withIgnoredFields("hashInitialized", "hashValue", "hash").usingGetClass().verify();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we need to remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something with EqualsVerifier, it doesn't allow the hashCode being delegated to an abstract method, i'm not quite sure, but looks like it doesnt like the hash longSupplier

@cecemei cecemei merged commit 6afb7dc into apache:master Oct 8, 2025
109 of 110 checks passed
@cecemei cecemei added this to the 35.0.0 milestone Oct 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants