Add support for (shredded) variants to DuckLake by Mytherin · Pull Request #750 · duckdb/ducklake

Mytherin · 2026-02-05T20:09:32Z

This PR adds support for the variant type to DuckLake. This type is similar to JSON, with a few differences:

Variant is more strongly typed internally, and has support for many more types than JSON - including types like DATE and TIMESTAMP
Variant is stored in a binary encoded format, not as a string
Variant can be "shredded" into primitive types, allowing variants with consistent schemas to be stored and queried much more efficiently

Variants in Parquet are stored according to the Parquet specification - see duckdb/duckdb#19336 for the implementation in DuckDB. The main changes required in DuckLake are handling of statistics around VARIANT values.

Variant Statistics

The main goal behind variant statistics is for shredded variants to have the same performance as native primitive types. As such, we need to be able to store native stats for the sub-fields of a variant that are fully shredded, and do things like skipping files based on these statistics.

For every file we write, we extract the stats for all fully shredded sub-fields within a variant. Fully shredded means that, for every row, either (1) that field is present and stored as a primitive of that specific type, (2) that field is missing, or (3) that field is NULL.

Variant stats are stored in a new table: ducklake_file_variant_stats.

CREATE TABLE ducklake_file_variant_stats(
    data_file_id BIGINT,
    table_id BIGINT,
    column_id BIGINT,
    variant_path VARCHAR,
    shredded_type VARCHAR,
    column_size_bytes BIGINT,
    value_count BIGINT,
    null_count BIGINT,
    min_value VARCHAR,
    max_value VARCHAR,
    contains_nan BOOLEAN,
    extra_stats VARCHAR);

For every fully shredded field for each file, this stores the variant_path and shredded_type together with the corresponding stats. For example, here's what a Parquet file from a shredded lineitem table stored in a single variant field looks like:

┌──────────────┬──────────┬───────────┬───────────────────┬───────────────┬───────────────────┬─────────────┬────────────┬─────────────┬────────────────────────────┬──────────────┬─────────────┐
│ data_file_id │ table_id │ column_id │   variant_path    │ shredded_type │ column_size_bytes │ value_count │ null_count │  min_value  │         max_value          │ contains_nan │ extra_stats │
│    int64     │  int64   │   int64   │      varchar      │    varchar    │       int64       │    int64    │   int64    │   varchar   │          varchar           │   boolean    │   varchar   │
├──────────────┼──────────┼───────────┼───────────────────┼───────────────┼───────────────────┼─────────────┼────────────┼─────────────┼────────────────────────────┼──────────────┼─────────────┤
│            0 │        1 │         1 │ "l_tax"           │ decimal(15,2) │            806245 │     1597440 │          0 │ 0.00        │ 0.08                       │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_suppkey"       │ int64         │           3323282 │     1597440 │          0 │ 1           │ 10000                      │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_shipinstruct"  │ varchar       │            607016 │     1597440 │          0 │ COLLECT COD │ TAKE BACK RETURN           │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_shipdate"      │ date          │           2534434 │     1597440 │          0 │ 1992-01-02  │ 1998-12-01                 │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_returnflag"    │ varchar       │            375008 │     1597440 │          0 │ A           │ R                          │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_linenumber"    │ int64         │            387885 │     1597440 │          0 │ 1           │ 7                          │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_receiptdate"   │ date          │           2535152 │     1597440 │          0 │ 1992-01-04  │ 1998-12-31                 │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_orderkey"      │ int64         │           2489515 │     1597440 │          0 │ 122308      │ 1965540                    │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_linestatus"    │ varchar       │            250352 │     1597440 │          0 │ F           │ O                          │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_partkey"       │ int64         │           7498395 │     1597440 │          0 │ 1           │ 200000                     │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_discount"      │ decimal(15,2) │            806339 │     1597440 │          0 │ 0.00        │ 0.10                       │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_comment"       │ varchar       │          20316934 │     1597440 │          0 │  Tiresias   │ zzle? furiously iro        │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_commitdate"    │ date          │           2531440 │     1597440 │          0 │ 1992-01-31  │ 1998-10-31                 │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_shipmode"      │ varchar       │            606740 │     1597440 │          0 │ AIR         │ TRUCK                      │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_quantity"      │ decimal(15,2) │           1208096 │     1597440 │          0 │ 1.00        │ 50.00                      │ NULL         │ NULL        │
│            0 │        1 │         1 │ "l_extendedprice" │ decimal(15,2) │           8241564 │     1597440 │          0 │ 904.00      │ 104899.50                  │ NULL         │ NULL        │
└──────────────┴──────────┴───────────┴───────────────────┴───────────────┴───────────────────┴─────────────┴────────────┴─────────────┴────────────────────────────┴──────────────┴─────────────┘

Named fields are always quoted in variant_path. Quotes in named fields are escaped with a double quote (""). There are two special names in the variant_path:

root - signifies the root variant element. If this is set the variant itself is not nested, but is a primitive type at the root (e.g. an integer, etc.)
element - this signifies an array, where the stats pertain to the child elements of that array

Here's some more example stats:

create table primitive_variant(v variant);
insert into primitive_variant select i from range(100) t(i);
┌──────────────┬──────────┬───────────┬───────────────────┬───────────────┬───────────────────┬─────────────┬────────────┬─────────────┬────────────────────────────┬──────────────┬─────────────┐
│ data_file_id │ table_id │ column_id │   variant_path    │ shredded_type │ column_size_bytes │ value_count │ null_count │  min_value  │         max_value          │ contains_nan │ extra_stats │
│    int64     │  int64   │   int64   │      varchar      │    varchar    │       int64       │    int64    │   int64    │   varchar   │          varchar           │   boolean    │   varchar   │
├──────────────┼──────────┼───────────┼───────────────────┼───────────────┼───────────────────┼─────────────┼────────────┼─────────────┼────────────────────────────┼──────────────┼─────────────┤
│            2 │        2 │         1 │ root              │ int64         │               452 │         100 │          0 │ 0           │ 99                         │ NULL         │ NULL        │
└──────────────┴──────────┴───────────┴───────────────────┴───────────────┴───────────────────┴─────────────┴────────────┴─────────────┴────────────────────────────┴──────────────┴─────────────┘

create table list_of_structs_variant(v variant);
insert into list_of_structs_variant select [{'a': i}, {'a': i + 100}] from range(100) t(i);
┌──────────────┬──────────┬───────────┬───────────────────┬───────────────┬───────────────────┬─────────────┬────────────┬─────────────┬────────────────────────────┬──────────────┬─────────────┐
│ data_file_id │ table_id │ column_id │   variant_path    │ shredded_type │ column_size_bytes │ value_count │ null_count │  min_value  │         max_value          │ contains_nan │ extra_stats │
│    int64     │  int64   │   int64   │      varchar      │    varchar    │       int64       │    int64    │   int64    │   varchar   │          varchar           │   boolean    │   varchar   │
├──────────────┼──────────┼───────────┼───────────────────┼───────────────┼───────────────────┼─────────────┼────────────┼─────────────┼────────────────────────────┼──────────────┼─────────────┤
│            3 │        3 │         1 │ element."a"       │ int64         │               972 │         200 │          0 │ 0           │ 199                        │ NULL         │ NULL        │
└──────────────┴──────────┴───────────┴───────────────────┴───────────────┴───────────────────┴─────────────┴────────────┴─────────────┴────────────────────────────┴──────────────┴─────────────┘

Global Statistics

The global stats for variants are stored in the extra_stats field in the ducklake_table_column_stats table. The stats stored are the same as the stats stored in ducklake_file_variant_stats, however, they are stored in JSON format similar to how Geometry is stored. The global stats store stats only for fields that are shredded in every single file, i.e. if we write any inconsistent data for fields we can end up with not having any global stats quite easily, after which point we will no longer have global variant stats. Nevertheless the global stats can be useful for variants that at least have a subset of consistent fields.

Data Inlining (Future Work)

Data inlining of variants is currently supported only for DuckDB. It is not yet supported for other databases (e.g. Postgres). In DuckDB, variants can just be stored as variants and the type works natively. For systems that don't support a variant type the main challenge is that variants do not round-trip to strings as type information is lost - as such we need to come up with a way of storing variants in these systems while maintaining type information.

There's two options:

Invent a new string representation that includes types, e.g.:

{"a": 42}

Can be stored as:

{"a": {"type": "int32", "value": 42}}

Store variants as the binary format that is also used in Parquet, encoded as e.g. base64, e.g.:

memory D select variant_to_parquet_variant({'a': 42}::VARIANT) pq;
┌─────────────────────────────────────────────────────────────────────────────────┐
│                                       pq                                        │
│                                 parquet_variant                                 │
├─────────────────────────────────────────────────────────────────────────────────┤
│ {'metadata': \x11\x01\x00\x01a, 'value': \x02\x01\x00\x00\x05\x14*\x00\x00\x00} │
└─────────────────────────────────────────────────────────────────────────────────┘

This is straightforward but harder to explain / document. On the flip side, DuckLake is already relatively tied to Parquet, so tying the variant inline format to the Parquet binary format is not that far fetched.

File Pruning (Future Work)

The per-file stats can be used to exclude files, e.g. for a query like this:

SELECT * FROM lineitem_variant WHERE lineitem.l_orderkey = 7;

We can use the following query to find a list of files that can for sure not satisfy this predicate:

SELECT data_file_id FROM ducklake_file_variant_stats WHERE variant_path = '"l_orderkey"' AND shredded_type = 'int64' AND 7 NOT BETWEEN min_value::BIGINT AND max_value::BIGINT;

This is future work but just a POC that this is possible given the stats we are writing now.

Parquet Variant Select Pushdown (Future Work)

While variants are already shredded to an optimized primitive representation, DuckDB does not yet support pushing down selections into these shredded fields directly for Parquet files. As such, selections on variants will be slow even if they are shredded as we will always reconstruct the full variant and then re-extract the field from the variant. This is something that will be fixed upstream in DuckDB in the near future.

…e provided stats, actually.. WIP: need to make sure the type is always present, even if not a root column

… have the types if the VARIANT is a child of a list for example

…only producing 'variant_layouts' instead

…nto variant_support

…and skip several tests that don't work

Tishj and others added 30 commits January 19, 2026 12:56

WIP: adding support for VARIANT, stats is work in progress

b32a950

WIP: adding nested fields to the variant stats

e16896b

add missing files

d0e856a

building the variant stats fields, then populating them by parsing th…

4fb0380

…e provided stats, actually.. WIP: need to make sure the type is always present, even if not a root column

add the type to the node while creating them, this way we should also…

d8a6e6c

… have the types if the VARIANT is a child of a list for example

serialized the variant stats to JSON

cc0b454

implement deserialize for the variant stats

084350c

convert ducklake stats to BaseStatistics

8ab5a5e

update submodule

15e57f2

added merging of the stats

7f470e6

some cleanup and bug fixes

5f94b8c

add tests

c2a0dc3

fix stats issues

4a6006d

adapt to the change to RETURN_STATS not producing 'column_types' but …

74687f1

…only producing 'variant_layouts' instead

adjust to changes

b2b9c6b

Merge branch 'main' into variant_support

e59e4a2

adjust to changes

513cd84

Merge branch 'variant_support' of https://github.com/Tishj/ducklake i…

a510391

…nto variant_support

Compiling but broken

c7203a7

Move parse stats to extra stats

f4b312f

Add back support for parsing variant stats

ac20a4f

Move variant stats to custom file

91a95b2

Clean-up extra stats, move to separate file

1763d35

Add support for writing variant stats

73a4e4f

Variants mostly working

7418592

All variant tests working

aa2bb9a

Make variant inlining work

42ea210

Explicitly disable variant inlining for non-DuckDB catalogs for now, …

12d3048

…and skip several tests that don't work

WIP: (de)serialize variant stats to JSON

f23993e

Global variant stats working

0f9da35

Mytherin added 3 commits February 5, 2026 18:22

More test fixes

28715f8

Skip failing postgres tests

237220c

Remove blob stats

3312582

Mytherin added the Needs Documentation label Feb 5, 2026

duckdblabs-bot mentioned this pull request Feb 5, 2026

[ducklake/#750] - Add support for (shredded) variants to DuckLake needs documentation duckdb/ducklake-web#279

Open

Mytherin merged commit b0846a2 into duckdb:main Feb 6, 2026
40 of 41 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for (shredded) variants to DuckLake#750

Add support for (shredded) variants to DuckLake#750
Mytherin merged 33 commits intoduckdb:mainfrom
Mytherin:variant_support

Mytherin commented Feb 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Mytherin commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Variant Statistics

Global Statistics

Data Inlining (Future Work)

File Pruning (Future Work)

Parquet Variant Select Pushdown (Future Work)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Mytherin commented Feb 5, 2026 •

edited

Loading