Skip to content

Conversation

@cecemei
Copy link
Contributor

@cecemei cecemei commented Sep 30, 2025

Description

This PR introduces support for computing JSON values directly from dictionary or index structures, allowing ingestion to skip persisting raw JSON data entirely. This reduces on-disk storage size.


Key changed/added classes in this PR
  • StructuredDataBuilder
  • StructuredDataBuilderTest
  • NestedDataColumnSerializer
  • CompressedNestedDataComplexColumn
  • CompressedVariableSizedBlobColumnSerializer

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@cecemei cecemei changed the title derive-json Adds support for compute JSON from dictionary. JSON data can now be computed from its dictionary/ index, with an option to skip storing the raw JSON entirely Oct 1, 2025
@cecemei cecemei marked this pull request as ready for review October 2, 2025 00:50
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review, still need to look at StructuredDataBuilder

Comment on lines 343 to 350
final List<StructuredDataBuilder.Element> elements =
fieldPathMap.keySet().stream()
.map(path -> StructuredDataBuilder.Element.of(
path,
(Objects.requireNonNull(getColumnHolder(path)).getColumn()).makeColumnValueSelector(offset)
.getObject()
))
.collect(Collectors.toList());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

making new offset and column value selector for every row seems pretty unchill... while I don't think much should be calling this method since we implement makeColumnValueSelector, could you look further into whether or not something will call this? If something does, it would be good to know to either consider changing it or knowing if it would be ok. or if nothing is expected to call this maybe we throw an exception

@cecemei cecemei requested a review from clintropolis October 3, 2025 00:28
pair.lhs.fieldName,
pair.lhs.fieldIndex
).getColumn();
return StructuredDataBuilder.Element.of(pair.rhs, column.lookupObject(rowNum));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not correct.i think, it looks like lookupObject implementation takes a dictionaryId, but you're passing a row number in here.

To get the dictionaryId of the NestedFieldDictionaryEncodedColumn, you need to get the value at rowNum of NestedFieldDictionaryEncodedColumn.column, so you want to be doing column.lookupObject(column.getSingleRowValue(rowNum)) i think.

Did you ever figure out if this method gets called for real? It must not commonly at least or else this would have caused problems i think

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the catch! seems like it's not covered by test, added test coverage in NestedDataColumnSupplierTest

Comment on lines +441 to +456
final List<Pair<List<NestedPathPart>, ColumnValueSelector>> fieldSelectors =
allFields.stream()
.map(pair -> Pair.of(
pair.rhs,
((DictionaryEncodedColumn) getColumnHolder(
pair.lhs.fieldName,
pair.lhs.fieldIndex
).getColumn()).makeColumnValueSelector(readableAtomicOffset)
))
.collect(Collectors.toList());
valueProvider = () -> {
List<StructuredDataBuilder.Element> elements = fieldSelectors
.stream()
.map(c -> StructuredDataBuilder.Element.of(c.lhs, c.rhs.getObject()))
.collect(Collectors.toList());
return new StructuredDataBuilder(elements).build();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't have to be done in this PR, but my experience has been that java streams is less performant than normal loops in hot code-paths, might be worth measuring this in the future (I don't think we need to worry about this in this PR)

@clintropolis
Copy link
Member

clintropolis commented Oct 7, 2025

It would be nice to get some additional test coverage on this, could you consider adding tests in this mode to NestedDataScanQueryTest.java or CalciteNestedDataQueryTest.java. Ideally replicas of existing tests that are relying on the raw column, so we can compare the results between the 2 modes

@cecemei
Copy link
Contributor Author

cecemei commented Oct 7, 2025

It would be nice to get some additional test coverage on this, could you consider adding tests in this mode to NestedDataScanQueryTest.java or CalciteNestedDataQueryTest.java. Ideally replicas of existing tests that are relying on the raw column, so we can compare the results between the 2 modes

Added test coverage for both classes.

@cecemei cecemei requested a review from clintropolis October 8, 2025 00:57
Copy link
Member

@clintropolis clintropolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving, but please do the SQL test change suggested in the comment before merging.

Also, not a blocker for merging this, but it would be very useful to take some measurements both about the segment size and query speed difference (compaction task running time difference is also probably interesting) so that cluster operators know what they are getting themselves into.

For query speed, there is SqlNestedDataBenchmark which could add just a simple scan query on the nested column, though it isn't very complex schema wise, it would still be useful starting point I think, so consider doing this in a follow-up PR

Comment on lines 162 to 166
.add(new AutoTypeColumnSchema("string", null, NONE_OBJECT_STORAGE))
.add(new AutoTypeColumnSchema("nest", null, NONE_OBJECT_STORAGE))
.add(new AutoTypeColumnSchema("nester", null, NONE_OBJECT_STORAGE))
.add(new AutoTypeColumnSchema("long", null, NONE_OBJECT_STORAGE))
.add(new AutoTypeColumnSchema("string_sparse", null, NONE_OBJECT_STORAGE))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, i don't think we should replace the existing tests with the new mode, could you add this as a separate datasource and just dupe the tests (or parameterize on table name or something) so we have coverage for both modes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added as nested test, it's a bit tricky to use parameterized test since the component supplier thing has to be static.

Comment on lines +372 to +388
List<Pair<List<NestedPathPart>, ColumnValueSelector>> fieldSelectors =
getAllParsedNestedFields().stream()
.map(pair -> Pair.of(
pair.rhs,
((DictionaryEncodedColumn) getColumnHolder(
pair.lhs.fieldName,
pair.lhs.fieldIndex
).getColumn()).makeColumnValueSelector(offset)
))
.collect(Collectors.toList());
valueProvider = () -> {
List<StructuredDataBuilder.Element> elements = fieldSelectors
.stream()
.map(c -> StructuredDataBuilder.Element.of(c.lhs, c.rhs.getObject()))
.collect(Collectors.toList());
return new StructuredDataBuilder(elements).build();
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

been thinking a bit on this, and I don't think we need to change anything in this PR since this is sufficient for testing and experimentation, but I think this could probably be done a lot more efficiently and just build the object on the fly while iterating over the set of fields only once per row produced and without extra transient object instantiations that need collected (the builder, elements, etc).

The field positions in fieldsSupplier are the same as the internal column names we get from getColumnHolder, so it would map pretty well to a plain for loop that just looks up the List for the field position and use that to add the field value to the object. It would be more or less inlining the buildObject method of the builder i think, so like less pleasant to look at but also a fair bit less overhead I would think.

@cecemei cecemei merged commit 259e62e into apache:master Oct 9, 2025
60 checks passed
@cecemei cecemei added this to the 35.0.0 milestone Oct 21, 2025
cecemei added a commit to cecemei/druid that referenced this pull request Oct 21, 2025
…omputed from its dictionary/ index, with an option to skip storing the raw JSON entirely (apache#18589)

* derive-json

* default-read-raw

* object-encoding

* default

* buffer

* format

* lazy-supplier

* revert-column-config

* serializer

* supplier

* value-provider

* test

* javadoc

* get-row-value

* nested

* format

* test

* trigger ci / empty commit

* static
@cecemei cecemei removed this from the 35.0.0 milestone Oct 21, 2025
@cryptoe
Copy link
Contributor

cryptoe commented Oct 28, 2025

@cecemei
Could you please update the description with the release notes.
How to use this feature is missing in the release notes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants