Skip to content

Ordered compaction and inlining with full catalog integration#642

Merged
pdet merged 66 commits intoduckdb:mainfrom
Alex-Monahan:ordered-compaction-catalog
Jan 27, 2026
Merged

Ordered compaction and inlining with full catalog integration#642
pdet merged 66 commits intoduckdb:mainfrom
Alex-Monahan:ordered-compaction-catalog

Conversation

@Alex-Monahan
Copy link
Contributor

@Alex-Monahan Alex-Monahan commented Dec 22, 2025

Hi folks!

This is a new PR meant to solve the same use case as #593, but addressing the PR feedback! Thank you for the guidance - it was super helpful. I am still open to any changes you recommend!

PR Overview

The purpose of this PR is to sort data while it is written to speed up selective read queries in the future.

This uses the pre-existing DuckDB SET SORTED BY syntax (from here duckdb/duckdb#16714) to sort data when it is compacted or inlined data is flushed. For example,

ALTER TABLE ducklake.my_table SET SORTED BY (sort_key_1 ASC, sort_key_2 DESC);

Then when either ducklake_merge_adjacent_files or ducklake_flush_inlined_data are called, those operations will sort the data prior to writing it out as parquet.

New Tables in DuckLake Spec

This adds 2 new tables to the DuckLake spec: ducklake_sort_info and ducklake_sort_expression.

ducklake_sort_info keeps a version history of the sort settings for tables over time. It has 1 row per time that a table has a new sort setting applied. (The prior PR used an option for this, but that has been removed based on the feedback).

CREATE TABLE {METADATA_CATALOG}.ducklake_sort_info(
  sort_id BIGINT,
  table_id BIGINT,
  begin_snapshot BIGINT,
  end_snapshot BIGINT
);

ducklake_sort_expression tracks the details of that sort. For each time a new sort setting is applied, this table includes one row for each expression in the order by. (If I order by column3 asc, column42 desc, column64 asc, then there will be 3 rows.)

CREATE TABLE {METADATA_CATALOG}.ducklake_sort_expression(
  sort_id BIGINT,
  table_id BIGINT,
  sort_key_index BIGINT,     -- The sequence the SORTED BY expressions are evaluated in
  expression VARCHAR,
  dialect VARCHAR,
  sort_direction VARCHAR,    -- ASC or DESC
  null_order VARCHAR         -- NULLS_LAST or NULLS_FIRST
);

Future Work / Limitations

There are still a few limitations with this PR:

  1. This does not order during insert (only during compaction and inline flush).
    • I would love to try to do a follow up PR to add this!
    • A user can still specify their own ORDER BY during an insert, so there is a workaround for the moment
  2. Only explicit column names can be used in the sorting, not expressions.
    • I have plans to add this in a follow up PR
    • There is a friendly error message (and tests) to document this limitation.
    • The spec has a column expression, so the intention was to make the spec itself forwards-compatible with expression-oriented sorting.
  3. Files are still selected for compaction based on insertion order. It could be better to sort the list of files by min/max metadata before selecting files for compaction.
    • Let me know if this is desirable and I can work on it after the "order during insert" in number 1!

I believe that I made this fully compatible with the batching code, but I was testing locally on a DuckDB catalog and not on Postgres. Any extra eyes on that side would be great!

If this looks good, I can also do any docs PRs that you recommend - happy to help there.

Thanks folks! CC @philippmd as well as an FYI

@Alex-Monahan
Copy link
Contributor Author

So, to fix the assertion issues on my fork's CI, I had to relax an assertion. Please let me know if I am off base, but I think that the assertion was too tight.
In src/storage/ducklake_catalog.cpp lines 439-461, the table_entry_map can have views added to it. However, there was an assertion in DuckLakeCatalogSet::GetEntryById(TableIndex index) that required a table (and not a view). Allowing a view there appears to solve things.

Could you point me out to which test broke this requirement?

From what I can tell, there are other parts of the code that require this to return a table, as we deference it to a DuckLakeTableEntry

e.g.,

unique_ptr<DuckLakeStats> DuckLakeCatalog::ConstructStatsMap(vector<DuckLakeGlobalStatsInfo> &global_stats,
                                                             DuckLakeCatalogSet &schema) {
	auto lake_stats = make_uniq<DuckLakeStats>();
	for (auto &stats : global_stats) {
		// find the referenced table entry
		auto table_entry = schema.GetEntryById(stats.table_id);

Sure! The tests that broke are here in this CI run on my fork. They were all running a query like

COMMENT ON VIEW ducklake.comment_view IS 'con1';

I can create a view-specific GetEntryById function if that would be better!

@Alex-Monahan
Copy link
Contributor Author

Alex-Monahan commented Jan 19, 2026

As I am thinking more about this, I'm wondering if the updates I made to the catalog cache would be safe across processes. Is it safe to not update the schema_version? Could other processes use a stale sort order, or reset it back to being unsorted?

Maybe this is something we can test with concurrentloops? (see test/sql/snapshot_info/ducklake_last_commit.test)

Unfortunately, I believe that concurrentloop uses the same DuckDB instance with multiple threads, and the DuckLake catalog is only created once per ATTACH, so it is shared across all threads. I think the issue that might exist from no longer incrementing schema_version would be when two totally separate DuckLakeCatalog instances have different sort information in their cache (with the same schema_version). Is there a multi-process version of concurrentloop? Or maybe a C++ or Python test? Do you want me to remove the schema_version modifications and save them for a later PR?

@pdet
Copy link
Collaborator

pdet commented Jan 19, 2026

As I am thinking more about this, I'm wondering if the updates I made to the catalog cache would be safe across processes. Is it safe to not update the schema_version? Could other processes use a stale sort order, or reset it back to being unsorted?

Maybe this is something we can test with concurrentloops? (see test/sql/snapshot_info/ducklake_last_commit.test)

Unfortunately, I believe that concurrentloop uses the same DuckDB instance with multiple threads, and the DuckLake catalog is only created once per ATTACH, so it is shared across all threads. I think the issue that might exist from no longer incrementing schema_version would be when two totally separate DuckLakeCatalog instances have different sort information in their cache (with the same schema_version). Is there a multi-process version of concurrentloop? Or maybe a C++ or Python test? Do you want me to remove the schema_version modifications and save them for a later PR?

Could we achieve this with multiple connections then? Because that's also possible within sqltests

@Alex-Monahan
Copy link
Contributor Author

As I am thinking more about this, I'm wondering if the updates I made to the catalog cache would be safe across processes. Is it safe to not update the schema_version? Could other processes use a stale sort order, or reset it back to being unsorted?

Maybe this is something we can test with concurrentloops? (see test/sql/snapshot_info/ducklake_last_commit.test)

Unfortunately, I believe that concurrentloop uses the same DuckDB instance with multiple threads, and the DuckLake catalog is only created once per ATTACH, so it is shared across all threads. I think the issue that might exist from no longer incrementing schema_version would be when two totally separate DuckLakeCatalog instances have different sort information in their cache (with the same schema_version). Is there a multi-process version of concurrentloop? Or maybe a C++ or Python test? Do you want me to remove the schema_version modifications and save them for a later PR?

Could we achieve this with multiple connections then? Because that's also possible within sqltests

I am not sure! If you have a spot where I can find an example, I can give it a shot.

To understand the behavior, I made a Python script that kicks off 2 separate CLI processes. I found that the sort is ignored by the other process if the schema was already cached ahead of time. The good news is that the catalog DB itself continues to have the right values, but the cache does not get invalidated correctly.

The flow is:

  • Process 1: connects, creates the table and inserts into it
  • Process 2: connects and runs an ALTER TABLE ADD COLUMN, which caches the catalog
  • Process 1: ALTER TABLE SET SORTED BY
  • Process 1: Completes / exits
  • Process 2: Compacts (using the cached catalog)
  • Process 2: Pulls updated table (which does not show the right order, since the cached catalog was used)
  • Process 2: Completes / exits

If I omit the ALTER TABLE ADD COLUMN step in process 2, then there is no issue and the sort occurs correctly.

What do you recommend I do? I've thought about 3 options, but 3 would need some help!

  1. Keep the schema_version from incrementing, but accept this concurrency behavior.
  2. Allow the schema_version to increment but accept that a compaction barrier gets put in when sort is changed
  3. Keep the schema_version from incrementing but find some other way to correctly invalidate the cache or use a different key for the cache

ducklake_set_sorted_multiprocess_add_column.py

The logs get printed out for process 1 before printing out process 2, but I logged out some timestamps to show the true order:

uv run ./test/ducklake_set_sorted_multiprocess_add_column.py
┌─────────┬──────────────────────┐
│ process │     sum("range")     │
│ varchar │        int128        │
├─────────┼──────────────────────┤
│ sql_1   │ 12499999997500000000 │
└─────────┴──────────────────────┘
┌─────────┬─────────────────────────────────────────────────────────────┐
│ process │ (CAST(now() AS VARCHAR) || ' sql_1 finished SET SORTED BY') │
│ varcharvarchar                           │
├─────────┼─────────────────────────────────────────────────────────────┤
│ sql_1   │ 2026-01-19 19:16:39.22008-07 sql_1 finished SET SORTED BY   │
└─────────┴─────────────────────────────────────────────────────────────┘


┌─────────┬─────────────────────┐
│ process │    sum("range")     │
│ varchar │       int128        │
├─────────┼─────────────────────┤
│ sql_2   │ 1999999999000000000 │
└─────────┴─────────────────────┘
┌─────────┬────────────────────────────────────────────────────────────────────┐
│ process │ (CAST(now() AS VARCHAR) || ' sql_2 finished adding column') │
│ varcharvarchar                               │
├─────────┼────────────────────────────────────────────────────────────────────┤
│ sql_2   │ 2026-01-19 19:16:35.416129-07 sql_2 finished adding column  │
└─────────┴────────────────────────────────────────────────────────────────────┘
┌─────────┬──────────────────────┐
│ process │     sum("range")     │
│ varchar │        int128        │
├─────────┼──────────────────────┤
│ sql_2   │ 40499999995500000000 │
└─────────┴──────────────────────┘
┌─────────┬───────────────────────────────────────────────────────┐
│ process │ (CAST(now() AS VARCHAR) || ' sql_2 about to compact') │
│ varcharvarchar                        │
├─────────┼───────────────────────────────────────────────────────┤
│ sql_2   │ 2026-01-19 19:16:46.373997-07 sql_2 about to compact  │
└─────────┴───────────────────────────────────────────────────────┘
┌─────────┐
│ Success │
│ boolean │
├─────────┤
│ 0 rows  │
└─────────┘
┌─────────┬───────────┬────────────┬────────────┬─────────┐
│ process │ unique_id │ sort_key_1 │ sort_key_2 │  bonus  │
│ varchar │   int64   │   int64    │  varcharvarchar │
├─────────┼───────────┼────────────┼────────────┼─────────┤
│ sql_2   │         31 │ woot3      │ NULL    │
│ sql_2   │         20 │ woot2      │ NULL    │
│ sql_2   │         11 │ woot1      │ NULL    │
│ sql_2   │         00 │ woot0      │ NULL    │
│ sql_2   │         71 │ woot7      │ NULL    │
│ sql_2   │         60 │ woot6      │ NULL    │
│ sql_2   │         51 │ woot5      │ NULL    │
│ sql_2   │         40 │ woot4      │ NULL    │
└─────────┴───────────┴────────────┴────────────┴─────────┘
┌─────────┬─────────────┬────────────────┬────────────────────────────────────────────┐
│ process │ snapshot_id │ schema_version │                  changes                   │
│ varchar │    int64    │     int64      │          map(varchar, varchar[])           │
├─────────┼─────────────┼────────────────┼────────────────────────────────────────────┤
│ sql_2   │           00 │ {schemas_created=[main]}                   │
│ sql_2   │           11 │ {tables_created=[main.sort_on_compaction]} │
│ sql_2   │           21 │ {tables_inserted_into=[1]}                 │
│ sql_2   │           31 │ {tables_inserted_into=[1]}                 │
│ sql_2   │           42 │ {tables_altered=[1]}                       │
│ sql_2   │           52 │ {tables_altered=[1]}                       │
│ sql_2   │           62 │ {}                                         │
└─────────┴─────────────┴────────────────┴────────────────────────────────────────────┘
┌─────────┬──────────┬────────────────┬──────────────┬────────────────┬────────────┬────────────────┬────────────┐
│ process │ table_id │ begin_snapshot │ end_snapshot │ sort_key_index │ expression │ sort_direction │ null_order │
│ varchar │  int64   │     int64      │    int64     │     int64      │  varcharvarcharvarchar   │
├─────────┼──────────┼────────────────┼──────────────┼────────────────┼────────────┼────────────────┼────────────┤
│ sql_2   │        15NULL0 │ sort_key_1 │ DESC           │ NULLS_LAST │
│ sql_2   │        15NULL1 │ sort_key_2 │ DESC           │ NULLS_LAST │
└─────────┴──────────┴────────────────┴──────────────┴────────────────┴────────────┴────────────────┴────────────┘

@pdet
Copy link
Collaborator

pdet commented Jan 21, 2026

Hi @Alex-Monahan I had another pass on your PR, and it is looking great! I think there was a slight miscommunication issue wrt the snapshot changes. What I meant is that Sort/Comment, etc should not impact the ducklake_schema_versions table because that is used to ensure compaction of tables that have the same data-schema as in columns being the same in amount and type.

@Alex-Monahan
Copy link
Contributor Author

Alex-Monahan commented Jan 21, 2026

Hi @Alex-Monahan I had another pass on your PR, and it is looking great! I think there was a slight miscommunication issue wrt the snapshot changes. What I meant is that Sort/Comment, etc should not impact the ducklake_schema_versions table because that is used to ensure compaction of tables that have the same data-schema as in columns being the same in amount and type.

Hmm, so it is ok if they increment the schema_version inside of ducklake_snapshot, just not ducklake_schema_versions? I'm having trouble detangling how to only prevent updating the schema_version selectively.

@pdet
Copy link
Collaborator

pdet commented Jan 21, 2026

Hi @Alex-Monahan I had another pass on your PR, and it is looking great! I think there was a slight miscommunication issue wrt the snapshot changes. What I meant is that Sort/Comment, etc should not impact the ducklake_schema_versions table because that is used to ensure compaction of tables that have the same data-schema as in columns being the same in amount and type.

Hmm, so it is ok if they increment the schema_version inside of ducklake_snapshot, just not ducklake_schema_versions? I'm having trouble detangling how to only prevent updating the schema_version selectively.

I think it's fine that they increase the global counter of the schema_version, I'm really sorry for the confusion and added work!

@Alex-Monahan
Copy link
Contributor Author

Alex-Monahan commented Jan 26, 2026

@pdet, I believe I finally understood your advice! I am now incrementing the global schema_version, but not incrementing the ducklake_schema_versions table if the only change was a comment or a set sorted.

My Python test script works correctly now, and I updated the tests to focus on the ducklake_schema_versions table.

I believe this is ready for another review (but CI is still running on my fork). Thank you!!

@Alex-Monahan
Copy link
Contributor Author

Alex-Monahan commented Jan 26, 2026

Ok, the previously failing tests on my CI are now green! All I had to change were the tests: added some ORDER BY's for SQLite and used unique folder names so that I get accurate file counts.

Sorry for being optimistic about CI being a straight shot...

Copy link
Collaborator

@pdet pdet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey Alex, thanks again for all the great work! I think this is pretty much ready, just had a couple of last comments.

One thing i'm also wondering is, what happens if we set the exact same order twice in a row?

E.g.,

CREATE TABLE ducklake.renamed_columns_test (unique_id INTEGER, sort_key_1 INTEGER, sort_key_2 VARCHAR);


ALTER TABLE ducklake.renamed_columns_test SET SORTED BY (sort_key_1 ASC NULLS LAST, sort_key_2 ASC NULLS LAST)

-- should this error? do nothing? add a new entry?
ALTER TABLE ducklake.renamed_columns_test SET SORTED BY (sort_key_1 ASC NULLS LAST, sort_key_2 ASC NULLS LAST).

Can you also add in the description the schema and a brief comment of the tables you added? That will make @guillesd work easier for the docs!

struct DuckLakeSortFieldInfo {
idx_t sort_key_index = 0;
// TODO: Validate that expression is case insensitive when stored
string expression;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should handle the case insensitiveness for the column names in this PR already I think, and a test

CREATE TABLE t (MyColumn INT, AnotherCol VARCHAR);

ALTER TABLE t SET SORTED BY (mycolumn ASC);

ALTER TABLE t SET SORTED BY (MyColumn ASC);

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that is handled already! I added some tests and removed that outdated ToDo.

@Alex-Monahan
Copy link
Contributor Author

Hey Alex, thanks again for all the great work! I think this is pretty much ready, just had a couple of last comments.

One thing i'm also wondering is, what happens if we set the exact same order twice in a row?

E.g.,

CREATE TABLE ducklake.renamed_columns_test (unique_id INTEGER, sort_key_1 INTEGER, sort_key_2 VARCHAR);


ALTER TABLE ducklake.renamed_columns_test SET SORTED BY (sort_key_1 ASC NULLS LAST, sort_key_2 ASC NULLS LAST)

-- should this error? do nothing? add a new entry?
ALTER TABLE ducklake.renamed_columns_test SET SORTED BY (sort_key_1 ASC NULLS LAST, sort_key_2 ASC NULLS LAST).

Can you also add in the description the schema and a brief comment of the tables you added? That will make @guillesd work easier for the docs!

Thanks for the review! I have a test for that over in test/sql/sorted_table/merge_adjacent_sorted_repeated.test! I chose the "do nothing" path. I do a deduplication check here so that we don't add redundant entries into the catalog. Let me know if that is ok!

I'm happy to update the description, and @guillesd, let me know if I can be helpful on the docs!

@guillesd
Copy link
Contributor

Hey @Alex-Monahan if you can provide an EDIT in the PR description just detailing a bit your final implementation details, and adding the syntax (and options if applicable). That would be great. I tagged the PR so an issue should be now linked in the ducklake docs page.

@Alex-Monahan
Copy link
Contributor Author

Hey @Alex-Monahan if you can provide an EDIT in the PR description just detailing a bit your final implementation details, and adding the syntax (and options if applicable). That would be great. I tagged the PR so an issue should be now linked in the ducklake docs page.

Thank you! I've updated the description! No new options, but 2 new DuckLake spec tables.

@Alex-Monahan
Copy link
Contributor Author

It looks like CI failed on the Docker Build step. Mind giving it a re-run? CI runs smoothly on my fork now!

@pdet
Copy link
Collaborator

pdet commented Jan 27, 2026

Thanks!

@pdet pdet merged commit 0b8f1cf into duckdb:main Jan 27, 2026
51 of 63 checks passed
@redox
Copy link
Contributor

redox commented Jan 27, 2026

Great job @Alex-Monahan @pdet - an impressive amount of work we've been following from the sidelines!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants