Skip to content

Add support for (shredded) variants to DuckLake#750

Merged
Mytherin merged 33 commits intoduckdb:mainfrom
Mytherin:variant_support
Feb 6, 2026
Merged

Add support for (shredded) variants to DuckLake#750
Mytherin merged 33 commits intoduckdb:mainfrom
Mytherin:variant_support

Conversation

@Mytherin
Copy link
Contributor

@Mytherin Mytherin commented Feb 5, 2026

This PR adds support for the variant type to DuckLake. This type is similar to JSON, with a few differences:

  • Variant is more strongly typed internally, and has support for many more types than JSON - including types like DATE and TIMESTAMP
  • Variant is stored in a binary encoded format, not as a string
  • Variant can be "shredded" into primitive types, allowing variants with consistent schemas to be stored and queried much more efficiently

Variants in Parquet are stored according to the Parquet specification - see duckdb/duckdb#19336 for the implementation in DuckDB. The main changes required in DuckLake are handling of statistics around VARIANT values.

Variant Statistics

The main goal behind variant statistics is for shredded variants to have the same performance as native primitive types. As such, we need to be able to store native stats for the sub-fields of a variant that are fully shredded, and do things like skipping files based on these statistics.

For every file we write, we extract the stats for all fully shredded sub-fields within a variant. Fully shredded means that, for every row, either (1) that field is present and stored as a primitive of that specific type, (2) that field is missing, or (3) that field is NULL.

Variant stats are stored in a new table: ducklake_file_variant_stats.

CREATE TABLE ducklake_file_variant_stats(
    data_file_id BIGINT,
    table_id BIGINT,
    column_id BIGINT,
    variant_path VARCHAR,
    shredded_type VARCHAR,
    column_size_bytes BIGINT,
    value_count BIGINT,
    null_count BIGINT,
    min_value VARCHAR,
    max_value VARCHAR,
    contains_nan BOOLEAN,
    extra_stats VARCHAR);

For every fully shredded field for each file, this stores the variant_path and shredded_type together with the corresponding stats. For example, here's what a Parquet file from a shredded lineitem table stored in a single variant field looks like:

┌──────────────┬──────────┬───────────┬───────────────────┬───────────────┬───────────────────┬─────────────┬────────────┬─────────────┬────────────────────────────┬──────────────┬─────────────┐
│ data_file_id │ table_id │ column_id │   variant_path    │ shredded_type │ column_size_bytes │ value_count │ null_count │  min_value  │         max_value          │ contains_nan │ extra_stats │
│    int64     │  int64   │   int64   │      varcharvarchar    │       int64       │    int64    │   int64    │   varcharvarcharbooleanvarchar   │
├──────────────┼──────────┼───────────┼───────────────────┼───────────────┼───────────────────┼─────────────┼────────────┼─────────────┼────────────────────────────┼──────────────┼─────────────┤
│            011"l_tax"decimal(15,2) │            806245159744000.000.08NULLNULL        │
│            011"l_suppkey"       │ int64         │           332328215974400110000NULLNULL        │
│            011"l_shipinstruct"varchar60701615974400 │ COLLECT COD │ TAKE BACK RETURN           │ NULLNULL        │
│            011"l_shipdate"date2534434159744001992-01-021998-12-01NULLNULL        │
│            011"l_returnflag"varchar37500815974400 │ A           │ R                          │ NULLNULL        │
│            011"l_linenumber"    │ int64         │            3878851597440017NULLNULL        │
│            011"l_receiptdate"date2535152159744001992-01-041998-12-31NULLNULL        │
│            011"l_orderkey"      │ int64         │           2489515159744001223081965540NULLNULL        │
│            011"l_linestatus"varchar25035215974400 │ F           │ O                          │ NULLNULL        │
│            011"l_partkey"       │ int64         │           7498395159744001200000NULLNULL        │
│            011"l_discount"decimal(15,2) │            806339159744000.000.10NULLNULL        │
│            011"l_comment"varchar2031693415974400 │  Tiresias   │ zzle? furiously iro        │ NULLNULL        │
│            011"l_commitdate"date2531440159744001992-01-311998-10-31NULLNULL        │
│            011"l_shipmode"varchar60674015974400 │ AIR         │ TRUCK                      │ NULLNULL        │
│            011"l_quantity"decimal(15,2) │           1208096159744001.0050.00NULLNULL        │
│            011"l_extendedprice"decimal(15,2) │           824156415974400904.00104899.50NULLNULL        │
└──────────────┴──────────┴───────────┴───────────────────┴───────────────┴───────────────────┴─────────────┴────────────┴─────────────┴────────────────────────────┴──────────────┴─────────────┘

Named fields are always quoted in variant_path. Quotes in named fields are escaped with a double quote (""). There are two special names in the variant_path:

  • root - signifies the root variant element. If this is set the variant itself is not nested, but is a primitive type at the root (e.g. an integer, etc.)
  • element - this signifies an array, where the stats pertain to the child elements of that array

Here's some more example stats:

create table primitive_variant(v variant);
insert into primitive_variant select i from range(100) t(i);
┌──────────────┬──────────┬───────────┬───────────────────┬───────────────┬───────────────────┬─────────────┬────────────┬─────────────┬────────────────────────────┬──────────────┬─────────────┐
│ data_file_id │ table_id │ column_id │   variant_path    │ shredded_type │ column_size_bytes │ value_count │ null_count │  min_value  │         max_value          │ contains_nan │ extra_stats │
│    int64     │  int64   │   int64   │      varcharvarchar    │       int64       │    int64    │   int64    │   varcharvarcharbooleanvarchar   │
├──────────────┼──────────┼───────────┼───────────────────┼───────────────┼───────────────────┼─────────────┼────────────┼─────────────┼────────────────────────────┼──────────────┼─────────────┤
│            221 │ root              │ int64         │               4521000099NULLNULL        │
└──────────────┴──────────┴───────────┴───────────────────┴───────────────┴───────────────────┴─────────────┴────────────┴─────────────┴────────────────────────────┴──────────────┴─────────────┘

create table list_of_structs_variant(v variant);
insert into list_of_structs_variant select [{'a': i}, {'a': i + 100}] from range(100) t(i);
┌──────────────┬──────────┬───────────┬───────────────────┬───────────────┬───────────────────┬─────────────┬────────────┬─────────────┬────────────────────────────┬──────────────┬─────────────┐
│ data_file_id │ table_id │ column_id │   variant_path    │ shredded_type │ column_size_bytes │ value_count │ null_count │  min_value  │         max_value          │ contains_nan │ extra_stats │
│    int64     │  int64   │   int64   │      varcharvarchar    │       int64       │    int64    │   int64    │   varcharvarcharbooleanvarchar   │
├──────────────┼──────────┼───────────┼───────────────────┼───────────────┼───────────────────┼─────────────┼────────────┼─────────────┼────────────────────────────┼──────────────┼─────────────┤
│            331 │ element."a"       │ int64         │               97220000199NULLNULL        │
└──────────────┴──────────┴───────────┴───────────────────┴───────────────┴───────────────────┴─────────────┴────────────┴─────────────┴────────────────────────────┴──────────────┴─────────────┘

Global Statistics

The global stats for variants are stored in the extra_stats field in the ducklake_table_column_stats table. The stats stored are the same as the stats stored in ducklake_file_variant_stats, however, they are stored in JSON format similar to how Geometry is stored. The global stats store stats only for fields that are shredded in every single file, i.e. if we write any inconsistent data for fields we can end up with not having any global stats quite easily, after which point we will no longer have global variant stats. Nevertheless the global stats can be useful for variants that at least have a subset of consistent fields.

Data Inlining (Future Work)

Data inlining of variants is currently supported only for DuckDB. It is not yet supported for other databases (e.g. Postgres). In DuckDB, variants can just be stored as variants and the type works natively. For systems that don't support a variant type the main challenge is that variants do not round-trip to strings as type information is lost - as such we need to come up with a way of storing variants in these systems while maintaining type information.

There's two options:

  • Invent a new string representation that includes types, e.g.:
{"a": 42}

Can be stored as:

{"a": {"type": "int32", "value": 42}}
  • Store variants as the binary format that is also used in Parquet, encoded as e.g. base64, e.g.:
memory D select variant_to_parquet_variant({'a': 42}::VARIANT) pq;
┌─────────────────────────────────────────────────────────────────────────────────┐
│                                       pq                                        │
│                                 parquet_variant                                 │
├─────────────────────────────────────────────────────────────────────────────────┤
│ {'metadata': \x11\x01\x00\x01a, 'value': \x02\x01\x00\x00\x05\x14*\x00\x00\x00} │
└─────────────────────────────────────────────────────────────────────────────────┘

This is straightforward but harder to explain / document. On the flip side, DuckLake is already relatively tied to Parquet, so tying the variant inline format to the Parquet binary format is not that far fetched.

File Pruning (Future Work)

The per-file stats can be used to exclude files, e.g. for a query like this:

SELECT * FROM lineitem_variant WHERE lineitem.l_orderkey = 7;

We can use the following query to find a list of files that can for sure not satisfy this predicate:

SELECT data_file_id FROM ducklake_file_variant_stats WHERE variant_path = '"l_orderkey"' AND shredded_type = 'int64' AND 7 NOT BETWEEN min_value::BIGINT AND max_value::BIGINT;

This is future work but just a POC that this is possible given the stats we are writing now.

Parquet Variant Select Pushdown (Future Work)

While variants are already shredded to an optimized primitive representation, DuckDB does not yet support pushing down selections into these shredded fields directly for Parquet files. As such, selections on variants will be slow even if they are shredded as we will always reconstruct the full variant and then re-extract the field from the variant. This is something that will be fixed upstream in DuckDB in the near future.

Tishj and others added 30 commits January 19, 2026 12:56
…e provided stats, actually.. WIP: need to make sure the type is always present, even if not a root column
… have the types if the VARIANT is a child of a list for example
@Mytherin Mytherin merged commit b0846a2 into duckdb:main Feb 6, 2026
40 of 41 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants