-
Notifications
You must be signed in to change notification settings - Fork 8
[WIP] Export merge tree partition to object storage #939
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: antalya-25.6.5
Are you sure you want to change the base?
Conversation
manifest->items.reserve(data_parts.size()); | ||
for (const auto & data_part : data_parts) | ||
manifest->items.push_back({data_part->name, ""}); | ||
manifest->write(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check fsync_metadata
|
||
if (stats.status.code != 0) | ||
{ | ||
LOG_INFO(getLogger("ExportMergeTreePartitionToObjectStorageTask"), "Error importing part {}: {}", stats.part->name, stats.status.message); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exporting?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a bit confusing import vs export.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are just stubs, I will polish the entire PR once we are ok with approach, I fix all concurrency issues and etc.
|
||
std::vector<ExportsList::EntryPtr> export_list_entries; | ||
|
||
for (const auto & data_part : data_parts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sequential iteraion? I think we can make several parts run in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They run in parallel. Each part gets its own pipeline composed of ReadFromMergeTree -> StorageObjectStorageMergeTreeImporterSink
.
The N pipelines created for the N parts in a given partition are put under a single QueryPipeline export_pipeline
that will execute the individual pipelines in parallel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
setNumThreads impact the parallelism of pipeline in different moments.
And you don't control how the work between processors in the pipeline is distributed between the threads
throw Exception(ErrorCodes::LOGICAL_ERROR, "Root pipeline is not completed"); | ||
} | ||
|
||
export_pipeline.setNumThreads(local_context->getSettingsRef()[Setting::max_threads]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think every single export should be single-threaded (similar to merges).
We can get many threads by exporting more files in parallel (again - similar to merges).
This way it's simpler to control the parallelism / resources used by that BG work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this code already does what you are asking for: each part is single threaded, and many parts are parallelized according to max_threads
|
||
if (!already_exported_partition_ids.emplace(partition_id).second) | ||
{ | ||
throw Exception(ErrorCodes::PART_IS_LOCKED, "Partition {} has already been exported", partition_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
option to reexport after changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, while you were out we established a partition could not be exported more than once
{ | ||
for (const auto & disk : getDisks()) | ||
{ | ||
for (auto it = disk->iterateDirectory(relative_data_path); it->isValid(); it->next()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need a cleanup of old ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially, I was deleting the manifests as soon as the commit file was uploaded. But then we changed the requirements so that a partition could be exported only once. To be able to lock these partitions upon re-start, I opted for leaving the export manifests.
If we change that requirement, then I'll delete it for sure.
void StorageObjectStorageMergeTreePartImporterSink::onException(std::exception_ptr) | ||
{ | ||
/// we should not reach here | ||
std::terminate(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are you sure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nope, just stubs for now.
Part of the logic in this class is very hackish to keep the exceptions contained so that a single pipeline failure does not cause all the other pipelines to abort.
@@ -205,6 +214,15 @@ class IStorage : public std::enable_shared_from_this<IStorage>, public TypePromo | |||
virtuals.set(std::make_unique<VirtualColumnsDescription>(std::move(virtuals_))); | |||
} | |||
|
|||
virtual void commitExportPartitionTransaction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe some better place for that? IStrorage is too generic.
src/Storages/StorageMergeTree.cpp
Outdated
throw Exception(ErrorCodes::PART_IS_LOCKED, "Partition {} has already been exported", partition_id); | ||
} | ||
|
||
auto exports_tagger = std::make_shared<CurrentlyExportingPartsTagger>(std::move(all_parts), *this); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i will probably be problematic to do the same with replicated without messing with replication queue.
I think that just holding the references to the parts should be enough (AFAIR they will stay on disk inactive while you hold the refernce even if will be merged).
Changelog category (leave one):
Changelog entry (a user-readable short description of the changes that goes to CHANGELOG.md):
Implement exporting partitions from merge tree tables to object storage in a different format (e.g, parquet). The files are converted to the destination format in-memory.
Syntax:
ALTER TABLE merge_tree_table EXPORT PARTITION ID 'ABC' TO TABLE 's3_hive_table'
.Related settings:
export_merge_tree_partition_background_execution
andallow_experimental_export_merge_tree_partition
.<table_root>/pkey1=pvalue1/.../pkeyn=pvaluen/<snowflakeid>.parquet
). Most likely in the future we'll not be using snowflakeids for the filenames.commit_<partition_id>_<transaction_id>
. It shall contain the list of files that were uploaded in that transaction.disk moves
. I still need to decide if I'll create yet another queue or re-use one of the existing ones.anyDisk
for now.max_threads
system.exports
andsystem.part_log
Documentation entry for user-facing changes
...
Exclude tests: