[WIP] Export merge tree partition to object storage #939

arthurpassos · 2025-07-28T20:17:02Z

Changelog category (leave one):

New Feature

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Implement exporting partitions from merge tree tables to object storage in a different format (e.g, parquet). The files are converted to the destination format in-memory.

Syntax: ALTER TABLE merge_tree_table EXPORT PARTITION ID 'ABC' TO TABLE 's3_hive_table'.

Related settings: export_merge_tree_partition_background_execution and allow_experimental_export_merge_tree_partition.

The destination file names and paths, for now, are decided on the destination engine (I am only testing and thinking about S3 with hive, so <table_root>/pkey1=pvalue1/.../pkeyn=pvaluen/<snowflakeid>.parquet). Most likely in the future we'll not be using snowflakeids for the filenames.
A commit file should be uploaded at the end of the execution to signal the completion of the transaction, the filename is: commit_<partition_id>_<transaction_id>. It shall contain the list of files that were uploaded in that transaction.
A partition can not be exported twice. The limitation comes from the fact upon re-export we don't have a reliable way of telling which parts should be exported (we can't duplicate data). Parts might have been merged with un-exported parts and etc. Perhaps we could lock these parts from merges and mutations forever? That is a question for the audience.
While a partition is being exported, the set of parts collected for that export can not be merged or mutated.
Exports should be able to recover from hard failures/disasters (hard re-start or crash). This is controlled using export manifests that are written on disk.
Upon re-start, exports are scheduled based on when they were created.
For now, exports are being scheduled in the same list of disk moves. I still need to decide if I'll create yet another queue or re-use one of the existing ones.
I have not tested how it behaves on "soft failures (e.g, one stream out of multiple ones failed)". I suspect it is not properly implemented yet.
Export manifests are being written on anyDisk for now.
Number of streams should be equal to max_threads
There is some half-baked observability on system.exports and system.part_log

Documentation entry for user-facing changes

...

Exclude tests:

github-actions · 2025-07-28T20:17:53Z

Workflow [PR], commit [34f7130]

…silly tests

…ception message

filimonov · 2025-09-02T11:47:09Z

src/Storages/MergeTree/MergeTreeExportManifest.h

+        manifest->items.reserve(data_parts.size());
+        for (const auto & data_part : data_parts)
+            manifest->items.push_back({data_part->name, ""});
+        manifest->write();


check fsync_metadata

filimonov · 2025-09-02T11:54:46Z

src/Storages/ObjectStorage/MergeTree/ExportPartitionPlainMergeTreeTask.cpp

+
+            if (stats.status.code != 0)
+            {
+                LOG_INFO(getLogger("ExportMergeTreePartitionToObjectStorageTask"), "Error importing part {}: {}", stats.part->name, stats.status.message);


a bit confusing import vs export.

These are just stubs, I will polish the entire PR once we are ok with approach, I fix all concurrency issues and etc.

filimonov · 2025-09-02T12:05:00Z

src/Storages/ObjectStorage/StorageObjectStorage.cpp

+
+    std::vector<ExportsList::EntryPtr> export_list_entries;
+
+    for (const auto & data_part : data_parts)


sequential iteraion? I think we can make several parts run in parallel.

They run in parallel. Each part gets its own pipeline composed of ReadFromMergeTree -> StorageObjectStorageMergeTreeImporterSink.

The N pipelines created for the N parts in a given partition are put under a single QueryPipeline export_pipeline that will execute the individual pipelines in parallel.

setNumThreads impact the parallelism of pipeline in different moments.
And you don't control how the work between processors in the pipeline is distributed between the threads

filimonov · 2025-09-02T12:06:39Z

src/Storages/ObjectStorage/StorageObjectStorage.cpp

+        throw Exception(ErrorCodes::LOGICAL_ERROR, "Root pipeline is not completed");
+    }
+
+    export_pipeline.setNumThreads(local_context->getSettingsRef()[Setting::max_threads]);


I think every single export should be single-threaded (similar to merges).
We can get many threads by exporting more files in parallel (again - similar to merges).

This way it's simpler to control the parallelism / resources used by that BG work

I think this code already does what you are asking for: each part is single threaded, and many parts are parallelized according to max_threads

filimonov · 2025-09-02T12:09:22Z

src/Storages/StorageMergeTree.cpp

+
+    if (!already_exported_partition_ids.emplace(partition_id).second)
+    {
+        throw Exception(ErrorCodes::PART_IS_LOCKED, "Partition {} has already been exported", partition_id);


option to reexport after changes?

Well, while you were out we established a partition could not be exported more than once

filimonov · 2025-09-02T12:10:48Z

src/Storages/StorageMergeTree.cpp

+{
+    for (const auto & disk : getDisks())
+    {
+        for (auto it = disk->iterateDirectory(relative_data_path); it->isValid(); it->next())


we need a cleanup of old ones

Initially, I was deleting the manifests as soon as the commit file was uploaded. But then we changed the requirements so that a partition could be exported only once. To be able to lock these partitions upon re-start, I opted for leaving the export manifests.

If we change that requirement, then I'll delete it for sure.

filimonov · 2025-09-02T12:26:29Z

src/Storages/ObjectStorage/MergeTree/StorageObjectStorageMergeTreePartImporterSink.cpp

+void StorageObjectStorageMergeTreePartImporterSink::onException(std::exception_ptr)
+{
+    /// we should not reach here
+    std::terminate();


are you sure?

Nope, just stubs for now.

Part of the logic in this class is very hackish to keep the exceptions contained so that a single pipeline failure does not cause all the other pipelines to abort.

filimonov · 2025-09-02T12:29:19Z

src/Storages/IStorage.h

@@ -205,6 +214,15 @@ class IStorage : public std::enable_shared_from_this<IStorage>, public TypePromo
        virtuals.set(std::make_unique<VirtualColumnsDescription>(std::move(virtuals_)));
    }

+    virtual void commitExportPartitionTransaction(


Maybe some better place for that? IStrorage is too generic.

filimonov · 2025-09-02T12:36:22Z

src/Storages/StorageMergeTree.cpp

+        throw Exception(ErrorCodes::PART_IS_LOCKED, "Partition {} has already been exported", partition_id);
+    }
+
+    auto exports_tagger = std::make_shared<CurrentlyExportingPartsTagger>(std::move(all_parts), *this);


i will probably be problematic to do the same with replicated without messing with replication queue.

I think that just holding the references to the parts should be enough (AFAIR they will stay on disk inactive while you hold the refernce even if will be merged).

squash export mt part to obj storage

9c0be2e

arthurpassos added 3 commits July 28, 2025 17:46

fix build1

65397b8

fix build for sure

55a7ac1

extension to lower

92f2f33

svb-alt added the antalya-25.6 label Jul 29, 2025

arthurpassos added 2 commits July 29, 2025 18:03

add tests and fix prefix

37ea31f

fix test

387cae4

arthurpassos mentioned this pull request Jul 30, 2025

[Draft] Export MergeTree part to Parquet #601

Closed

30 tasks

arthurpassos added 3 commits July 30, 2025 13:46

reduce changes

43abc4c

reduce changes even further

c7003ad

some adjustments

bb156ab

svb-alt added enhancement New feature or request tiered storage Antalya Roadmap: Tiered Storage labels Jul 30, 2025

arthurpassos added 2 commits July 30, 2025 17:46

rmv unused files

bb742af

rename a few things

4bac44a

svb-alt linked an issue Aug 8, 2025 that may be closed by this pull request

ALTER TABLE EXPORT to external table #595

Open

arthurpassos added 13 commits August 19, 2025 09:14

Merge branch 'antalya-25.6.5' into export_mt_part_to_object_storage

b02789e

rewind the part names logic

ea3a2a5

tmp

180fda8

good for a demo

45bf82b

do not drop parts, lock partition for further exports

41020a1

add partition_id to commit filename, remove unused code and refactor …

61928e4

…silly tests

simplify the code a bit

f8ad06f

rename from commit id to transaction id

1859244

use snowflakeid as transaction id

cdfa5ab

add back the sync behavior

9f9fcb2

minor changes

bfb72ae

add missing include for build

7dbb53f

freakin ai code suggestions..

2506663

arthurpassos added 5 commits August 26, 2025 16:53

add roundtrip check

6bc7c09

opsy

8e08991

remove export part, add some partition exp sanity checking, change ex…

54bf678

…ception message

add tests

71bc26f

Refactor to use a background task instead of inline code

b489f83

filimonov reviewed Sep 2, 2025

View reviewed changes

small stuff

ff68ba9

filimonov reviewed Sep 2, 2025

View reviewed changes

arthurpassos added 7 commits September 2, 2025 10:40

Merge branch 'antalya-25.6.5' into export_mt_part_to_object_storage

d3697c7

fix test

44c697c

fiox tests

8f171b8

implement single part task

8a51270

fix privileges test

af3352b

improve system.exports, show failed exports

502b501

opsy

34f7130


		std::vector<ExportsList::EntryPtr> export_list_entries;

		for (const auto & data_part : data_parts)

[WIP] Export merge tree partition to object storage #939

Are you sure you want to change the base?

[WIP] Export merge tree partition to object storage #939

Conversation

arthurpassos commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changelog category (leave one):

Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):

Documentation entry for user-facing changes

Exclude tests:

Uh oh!

github-actions bot commented Jul 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

arthurpassos commented Jul 28, 2025 •

edited

Loading

github-actions bot commented Jul 28, 2025 •

edited

Loading