Add support for (shredded) variants to DuckLake#750
Merged
Mytherin merged 33 commits intoduckdb:mainfrom Feb 6, 2026
Merged
Conversation
…e provided stats, actually.. WIP: need to make sure the type is always present, even if not a root column
… have the types if the VARIANT is a child of a list for example
…only producing 'variant_layouts' instead
…nto variant_support
…and skip several tests that don't work
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This PR adds support for the
varianttype to DuckLake. This type is similar to JSON, with a few differences:DATEandTIMESTAMPVariants in Parquet are stored according to the Parquet specification - see duckdb/duckdb#19336 for the implementation in DuckDB. The main changes required in DuckLake are handling of statistics around
VARIANTvalues.Variant Statistics
The main goal behind variant statistics is for shredded variants to have the same performance as native primitive types. As such, we need to be able to store native stats for the sub-fields of a variant that are fully shredded, and do things like skipping files based on these statistics.
For every file we write, we extract the stats for all fully shredded sub-fields within a variant. Fully shredded means that, for every row, either (1) that field is present and stored as a primitive of that specific type, (2) that field is missing, or (3) that field is
NULL.Variant stats are stored in a new table:
ducklake_file_variant_stats.For every fully shredded field for each file, this stores the
variant_pathandshredded_typetogether with the corresponding stats. For example, here's what a Parquet file from a shredded lineitem table stored in a single variant field looks like:Named fields are always quoted in
variant_path. Quotes in named fields are escaped with a double quote (""). There are two special names in thevariant_path:root- signifies the root variant element. If this is set the variant itself is not nested, but is a primitive type at the root (e.g. an integer, etc.)element- this signifies an array, where the stats pertain to the child elements of that arrayHere's some more example stats:
Global Statistics
The global stats for variants are stored in the
extra_statsfield in theducklake_table_column_statstable. The stats stored are the same as the stats stored inducklake_file_variant_stats, however, they are stored in JSON format similar to how Geometry is stored. The global stats store stats only for fields that are shredded in every single file, i.e. if we write any inconsistent data for fields we can end up with not having any global stats quite easily, after which point we will no longer have global variant stats. Nevertheless the global stats can be useful for variants that at least have a subset of consistent fields.Data Inlining (Future Work)
Data inlining of variants is currently supported only for DuckDB. It is not yet supported for other databases (e.g. Postgres). In DuckDB, variants can just be stored as variants and the type works natively. For systems that don't support a variant type the main challenge is that variants do not round-trip to strings as type information is lost - as such we need to come up with a way of storing variants in these systems while maintaining type information.
There's two options:
{"a": 42}Can be stored as:
{"a": {"type": "int32", "value": 42}}This is straightforward but harder to explain / document. On the flip side, DuckLake is already relatively tied to Parquet, so tying the variant inline format to the Parquet binary format is not that far fetched.
File Pruning (Future Work)
The per-file stats can be used to exclude files, e.g. for a query like this:
We can use the following query to find a list of files that can for sure not satisfy this predicate:
This is future work but just a POC that this is possible given the stats we are writing now.
Parquet Variant Select Pushdown (Future Work)
While variants are already shredded to an optimized primitive representation, DuckDB does not yet support pushing down selections into these shredded fields directly for Parquet files. As such, selections on variants will be slow even if they are shredded as we will always reconstruct the full variant and then re-extract the field from the variant. This is something that will be fixed upstream in DuckDB in the near future.