-
Notifications
You must be signed in to change notification settings - Fork 3.8k
Adds support for compute JSON from dictionary. JSON data can now be computed from its dictionary/ index, with an option to skip storing the raw JSON entirely #18589
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
partial review, still need to look at StructuredDataBuilder
processing/src/main/java/org/apache/druid/segment/column/ColumnConfig.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java
Outdated
Show resolved
Hide resolved
| final List<StructuredDataBuilder.Element> elements = | ||
| fieldPathMap.keySet().stream() | ||
| .map(path -> StructuredDataBuilder.Element.of( | ||
| path, | ||
| (Objects.requireNonNull(getColumnHolder(path)).getColumn()).makeColumnValueSelector(offset) | ||
| .getObject() | ||
| )) | ||
| .collect(Collectors.toList()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
making new offset and column value selector for every row seems pretty unchill... while I don't think much should be calling this method since we implement makeColumnValueSelector, could you look further into whether or not something will call this? If something does, it would be good to know to either consider changing it or knowing if it would be ok. or if nothing is expected to call this maybe we throw an exception
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java
Outdated
Show resolved
Hide resolved
...src/main/java/org/apache/druid/segment/data/CompressedVariableSizedBlobColumnSerializer.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java
Fixed
Show fixed
Hide fixed
| pair.lhs.fieldName, | ||
| pair.lhs.fieldIndex | ||
| ).getColumn(); | ||
| return StructuredDataBuilder.Element.of(pair.rhs, column.lookupObject(rowNum)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not correct.i think, it looks like lookupObject implementation takes a dictionaryId, but you're passing a row number in here.
To get the dictionaryId of the NestedFieldDictionaryEncodedColumn, you need to get the value at rowNum of NestedFieldDictionaryEncodedColumn.column, so you want to be doing column.lookupObject(column.getSingleRowValue(rowNum)) i think.
Did you ever figure out if this method gets called for real? It must not commonly at least or else this would have caused problems i think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the catch! seems like it's not covered by test, added test coverage in NestedDataColumnSupplierTest
| final List<Pair<List<NestedPathPart>, ColumnValueSelector>> fieldSelectors = | ||
| allFields.stream() | ||
| .map(pair -> Pair.of( | ||
| pair.rhs, | ||
| ((DictionaryEncodedColumn) getColumnHolder( | ||
| pair.lhs.fieldName, | ||
| pair.lhs.fieldIndex | ||
| ).getColumn()).makeColumnValueSelector(readableAtomicOffset) | ||
| )) | ||
| .collect(Collectors.toList()); | ||
| valueProvider = () -> { | ||
| List<StructuredDataBuilder.Element> elements = fieldSelectors | ||
| .stream() | ||
| .map(c -> StructuredDataBuilder.Element.of(c.lhs, c.rhs.getObject())) | ||
| .collect(Collectors.toList()); | ||
| return new StructuredDataBuilder(elements).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
doesn't have to be done in this PR, but my experience has been that java streams is less performant than normal loops in hot code-paths, might be worth measuring this in the future (I don't think we need to worry about this in this PR)
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java
Outdated
Show resolved
Hide resolved
processing/src/main/java/org/apache/druid/segment/nested/CompressedNestedDataComplexColumn.java
Outdated
Show resolved
Hide resolved
|
It would be nice to get some additional test coverage on this, could you consider adding tests in this mode to NestedDataScanQueryTest.java or CalciteNestedDataQueryTest.java. Ideally replicas of existing tests that are relying on the raw column, so we can compare the results between the 2 modes |
processing/src/test/java/org/apache/druid/query/scan/NestedDataScanQueryTest.java
Dismissed
Show dismissed
Hide dismissed
processing/src/test/java/org/apache/druid/query/scan/NestedDataScanQueryTest.java
Dismissed
Show dismissed
Hide dismissed
Added test coverage for both classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
approving, but please do the SQL test change suggested in the comment before merging.
Also, not a blocker for merging this, but it would be very useful to take some measurements both about the segment size and query speed difference (compaction task running time difference is also probably interesting) so that cluster operators know what they are getting themselves into.
For query speed, there is SqlNestedDataBenchmark which could add just a simple scan query on the nested column, though it isn't very complex schema wise, it would still be useful starting point I think, so consider doing this in a follow-up PR
| .add(new AutoTypeColumnSchema("string", null, NONE_OBJECT_STORAGE)) | ||
| .add(new AutoTypeColumnSchema("nest", null, NONE_OBJECT_STORAGE)) | ||
| .add(new AutoTypeColumnSchema("nester", null, NONE_OBJECT_STORAGE)) | ||
| .add(new AutoTypeColumnSchema("long", null, NONE_OBJECT_STORAGE)) | ||
| .add(new AutoTypeColumnSchema("string_sparse", null, NONE_OBJECT_STORAGE)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, i don't think we should replace the existing tests with the new mode, could you add this as a separate datasource and just dupe the tests (or parameterize on table name or something) so we have coverage for both modes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added as nested test, it's a bit tricky to use parameterized test since the component supplier thing has to be static.
| List<Pair<List<NestedPathPart>, ColumnValueSelector>> fieldSelectors = | ||
| getAllParsedNestedFields().stream() | ||
| .map(pair -> Pair.of( | ||
| pair.rhs, | ||
| ((DictionaryEncodedColumn) getColumnHolder( | ||
| pair.lhs.fieldName, | ||
| pair.lhs.fieldIndex | ||
| ).getColumn()).makeColumnValueSelector(offset) | ||
| )) | ||
| .collect(Collectors.toList()); | ||
| valueProvider = () -> { | ||
| List<StructuredDataBuilder.Element> elements = fieldSelectors | ||
| .stream() | ||
| .map(c -> StructuredDataBuilder.Element.of(c.lhs, c.rhs.getObject())) | ||
| .collect(Collectors.toList()); | ||
| return new StructuredDataBuilder(elements).build(); | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
been thinking a bit on this, and I don't think we need to change anything in this PR since this is sufficient for testing and experimentation, but I think this could probably be done a lot more efficiently and just build the object on the fly while iterating over the set of fields only once per row produced and without extra transient object instantiations that need collected (the builder, elements, etc).
The field positions in fieldsSupplier are the same as the internal column names we get from getColumnHolder, so it would map pretty well to a plain for loop that just looks up the List for the field position and use that to add the field value to the object. It would be more or less inlining the buildObject method of the builder i think, so like less pleasant to look at but also a fair bit less overhead I would think.
…omputed from its dictionary/ index, with an option to skip storing the raw JSON entirely (apache#18589) * derive-json * default-read-raw * object-encoding * default * buffer * format * lazy-supplier * revert-column-config * serializer * supplier * value-provider * test * javadoc * get-row-value * nested * format * test * trigger ci / empty commit * static
|
@cecemei |
Description
This PR introduces support for computing JSON values directly from dictionary or index structures, allowing ingestion to skip persisting raw JSON data entirely. This reduces on-disk storage size.
Key changed/added classes in this PR
StructuredDataBuilderStructuredDataBuilderTestNestedDataColumnSerializerCompressedNestedDataComplexColumnCompressedVariableSizedBlobColumnSerializerThis PR has: