Adds support for compute JSON from dictionary. JSON data can now be computed from its dictionary/ index, with an option to skip storing the raw JSON entirely #18589

cecemei · 2025-09-30T22:54:30Z

Description

This PR introduces support for computing JSON values directly from dictionary or index structures, allowing ingestion to skip persisting raw JSON data entirely. This reduces on-disk storage size.

Key changed/added classes in this PR

StructuredDataBuilder
StructuredDataBuilderTest
NestedDataColumnSerializer
CompressedNestedDataComplexColumn
CompressedVariableSizedBlobColumnSerializer

This PR has:

clintropolis

partial review, still need to look at StructuredDataBuilder

processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java

processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java

clintropolis · 2025-10-02T01:57:59Z

processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java

+    final List<StructuredDataBuilder.Element> elements =
+        fieldPathMap.keySet().stream()
+                    .map(path -> StructuredDataBuilder.Element.of(
+                        path,
+                        (Objects.requireNonNull(getColumnHolder(path)).getColumn()).makeColumnValueSelector(offset)
+                                                                                   .getObject()
+                    ))
+                    .collect(Collectors.toList());


making new offset and column value selector for every row seems pretty unchill... while I don't think much should be calling this method since we implement makeColumnValueSelector, could you look further into whether or not something will call this? If something does, it would be good to know to either consider changing it or knowing if it would be ok. or if nothing is expected to call this maybe we throw an exception

processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java

...src/main/java/org/apache/druid/segment/data/CompressedVariableSizedBlobColumnSerializer.java

processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java

clintropolis · 2025-10-06T22:59:41Z

processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java

+              pair.lhs.fieldName,
+              pair.lhs.fieldIndex
+          ).getColumn();
+          return StructuredDataBuilder.Element.of(pair.rhs, column.lookupObject(rowNum));


this is not correct.i think, it looks like lookupObject implementation takes a dictionaryId, but you're passing a row number in here.

To get the dictionaryId of the NestedFieldDictionaryEncodedColumn, you need to get the value at rowNum of NestedFieldDictionaryEncodedColumn.column, so you want to be doing column.lookupObject(column.getSingleRowValue(rowNum)) i think.

Did you ever figure out if this method gets called for real? It must not commonly at least or else this would have caused problems i think

thanks for the catch! seems like it's not covered by test, added test coverage in NestedDataColumnSupplierTest

clintropolis · 2025-10-06T23:27:47Z

processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java

+      final List<Pair<List<NestedPathPart>, ColumnValueSelector>> fieldSelectors =
+          allFields.stream()
+                   .map(pair -> Pair.of(
+                       pair.rhs,
+                       ((DictionaryEncodedColumn) getColumnHolder(
+                           pair.lhs.fieldName,
+                           pair.lhs.fieldIndex
+                       ).getColumn()).makeColumnValueSelector(readableAtomicOffset)
+                   ))
+                   .collect(Collectors.toList());
+      valueProvider = () -> {
+        List<StructuredDataBuilder.Element> elements = fieldSelectors
+            .stream()
+            .map(c -> StructuredDataBuilder.Element.of(c.lhs, c.rhs.getObject()))
+            .collect(Collectors.toList());
+        return new StructuredDataBuilder(elements).build();


doesn't have to be done in this PR, but my experience has been that java streams is less performant than normal loops in hot code-paths, might be worth measuring this in the future (I don't think we need to worry about this in this PR)

processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java

clintropolis · 2025-10-07T18:46:38Z

It would be nice to get some additional test coverage on this, could you consider adding tests in this mode to NestedDataScanQueryTest.java or CalciteNestedDataQueryTest.java. Ideally replicas of existing tests that are relying on the raw column, so we can compare the results between the 2 modes

processing/src/test/java/org/apache/druid/query/scan/NestedDataScanQueryTest.java

cecemei · 2025-10-07T22:34:03Z

It would be nice to get some additional test coverage on this, could you consider adding tests in this mode to NestedDataScanQueryTest.java or CalciteNestedDataQueryTest.java. Ideally replicas of existing tests that are relying on the raw column, so we can compare the results between the 2 modes

Added test coverage for both classes.

clintropolis

approving, but please do the SQL test change suggested in the comment before merging.

Also, not a blocker for merging this, but it would be very useful to take some measurements both about the segment size and query speed difference (compaction task running time difference is also probably interesting) so that cluster operators know what they are getting themselves into.

For query speed, there is SqlNestedDataBenchmark which could add just a simple scan query on the nested column, though it isn't very complex schema wise, it would still be useful starting point I think, so consider doing this in a follow-up PR

clintropolis · 2025-10-08T23:03:17Z

sql/src/test/java/org/apache/druid/sql/calcite/CalciteNestedDataQueryTest.java

+                       .add(new AutoTypeColumnSchema("string", null, NONE_OBJECT_STORAGE))
+                       .add(new AutoTypeColumnSchema("nest", null, NONE_OBJECT_STORAGE))
+                       .add(new AutoTypeColumnSchema("nester", null, NONE_OBJECT_STORAGE))
+                       .add(new AutoTypeColumnSchema("long", null, NONE_OBJECT_STORAGE))
+                       .add(new AutoTypeColumnSchema("string_sparse", null, NONE_OBJECT_STORAGE))


hmm, i don't think we should replace the existing tests with the new mode, could you add this as a separate datasource and just dupe the tests (or parameterize on table name or something) so we have coverage for both modes?

added as nested test, it's a bit tricky to use parameterized test since the component supplier thing has to be static.

clintropolis · 2025-10-09T06:54:35Z

processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java

+      List<Pair<List<NestedPathPart>, ColumnValueSelector>> fieldSelectors =
+          getAllParsedNestedFields().stream()
+                                    .map(pair -> Pair.of(
+                                        pair.rhs,
+                                        ((DictionaryEncodedColumn) getColumnHolder(
+                                            pair.lhs.fieldName,
+                                            pair.lhs.fieldIndex
+                                        ).getColumn()).makeColumnValueSelector(offset)
+                                    ))
+                                    .collect(Collectors.toList());
+      valueProvider = () -> {
+        List<StructuredDataBuilder.Element> elements = fieldSelectors
+            .stream()
+            .map(c -> StructuredDataBuilder.Element.of(c.lhs, c.rhs.getObject()))
+            .collect(Collectors.toList());
+        return new StructuredDataBuilder(elements).build();
+      };


been thinking a bit on this, and I don't think we need to change anything in this PR since this is sufficient for testing and experimentation, but I think this could probably be done a lot more efficiently and just build the object on the fly while iterating over the set of fields only once per row produced and without extra transient object instantiations that need collected (the builder, elements, etc).

The field positions in fieldsSupplier are the same as the internal column names we get from getColumnHolder, so it would map pretty well to a plain for loop that just looks up the List for the field position and use that to add the field value to the object. It would be more or less inlining the buildObject method of the builder i think, so like less pleasant to look at but also a fair bit less overhead I would think.

…omputed from its dictionary/ index, with an option to skip storing the raw JSON entirely (apache#18589) * derive-json * default-read-raw * object-encoding * default * buffer * format * lazy-supplier * revert-column-config * serializer * supplier * value-provider * test * javadoc * get-row-value * nested * format * test * trigger ci / empty commit * static

cryptoe · 2025-10-28T04:12:04Z

@cecemei
Could you please update the description with the release notes.
How to use this feature is missing in the release notes.

derive-json

afbfcd6

github-actions bot added Area - Querying Area - Segment Format and Ser/De labels Sep 30, 2025

cecemei added 4 commits September 30, 2025 17:31

default-read-raw

239d58f

Merge branch 'master' into json

abe6433

object-encoding

f8d84b8

default

5910492

cecemei changed the title ~~derive-json~~ Adds support for compute JSON from dictionary. JSON data can now be computed from its dictionary/ index, with an option to skip storing the raw JSON entirely Oct 1, 2025

cecemei added 2 commits September 30, 2025 21:22

buffer

1f093dc

format

5ff95ee

cecemei marked this pull request as ready for review October 2, 2025 00:50

clintropolis reviewed Oct 2, 2025

View reviewed changes

cecemei added 4 commits October 2, 2025 13:17

lazy-supplier

e044d0e

revert-column-config

748a64d

Merge branch 'master' into json

a39c9a6

serializer

5209d25

github-advanced-security bot found potential problems Oct 2, 2025

View reviewed changes

processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java Fixed Show fixed Hide fixed

cecemei added 2 commits October 2, 2025 15:33

supplier

9e9384b

value-provider

4b0fbd0

cecemei requested a review from clintropolis October 3, 2025 00:28

clintropolis reviewed Oct 7, 2025

View reviewed changes

cecemei added 3 commits October 7, 2025 14:22

test

5d03ac8

Merge branch 'master' into json

4fd34fd

javadoc

6f97741

github-advanced-security bot found potential problems Oct 7, 2025

View reviewed changes

processing/src/test/java/org/apache/druid/query/scan/NestedDataScanQueryTest.java Dismissed Show dismissed Hide dismissed

processing/src/test/java/org/apache/druid/query/scan/NestedDataScanQueryTest.java Dismissed Show dismissed Hide dismissed

get-row-value

fa9cad0

cecemei added 2 commits October 7, 2025 16:27

nested

48c0f2c

format

7cfff4f

cecemei requested a review from clintropolis October 8, 2025 00:57

clintropolis approved these changes Oct 9, 2025

View reviewed changes

cecemei added 4 commits October 9, 2025 12:53

test

f97a5c8

Merge branch 'master' into json

0c96904

trigger ci / empty commit

1b8c5e4

static

7d83b04

cecemei merged commit 259e62e into apache:master Oct 9, 2025
60 checks passed

cecemei added this to the 35.0.0 milestone Oct 21, 2025

cecemei mentioned this pull request Oct 21, 2025

[Backport] Adds support for compute JSON from dictionary. JSON data can now be computed from its dictionary/ index, with an option to skip storing the raw JSON entirely #18674

Closed

cecemei removed this from the 35.0.0 milestone Oct 21, 2025

cecemei added the Release Notes label Oct 29, 2025

Adds support for compute JSON from dictionary. JSON data can now be computed from its dictionary/ index, with an option to skip storing the raw JSON entirely #18589

Adds support for compute JSON from dictionary. JSON data can now be computed from its dictionary/ index, with an option to skip storing the raw JSON entirely #18589

Uh oh!

Conversation

cecemei commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Key changed/added classes in this PR

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

clintropolis Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clintropolis Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

cecemei Oct 7, 2025

Choose a reason for hiding this comment

Uh oh!

clintropolis Oct 6, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

clintropolis commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cecemei commented Oct 7, 2025

Uh oh!

clintropolis left a comment

Choose a reason for hiding this comment

Uh oh!

clintropolis Oct 8, 2025

Choose a reason for hiding this comment

Uh oh!

cecemei Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

clintropolis Oct 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

cryptoe commented Oct 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cecemei commented Sep 30, 2025 •

edited

Loading

clintropolis commented Oct 7, 2025 •

edited

Loading