-
Notifications
You must be signed in to change notification settings - Fork 340
Feature/metadata deduplication 2130 #2216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ForeverAngry
wants to merge
61
commits into
apache:main
Choose a base branch
from
ForeverAngry:feature/metadata-deduplication-2130
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Feature/metadata deduplication 2130 #2216
ForeverAngry
wants to merge
61
commits into
apache:main
from
ForeverAngry:feature/metadata-deduplication-2130
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
…h a new Expired Snapshot class. updated tests.
ValueError: Cannot expire snapshot IDs {3051729675574597004} as they are currently referenced by table refs.
Moved expiration-related methods from `ExpireSnapshots` to `ManageSnapshots` for improved organization and clarity. Updated corresponding pytest tests to reflect these changes.
Re-ran the `poetry run pre-commit run --all-files` command on the project.
Re-ran the `poetry run pre-commit run --all-files` command on the project.
Moved: the functions for expiring snapshots to their own class.
…ng it in a separate issue. Fixed: unrelated changes caused by afork/branch sync issues.
Co-authored-by: Fokko Driesprong <[email protected]>
Implemented logic to protect the HEAD branches or Tagged branches from being expired by the `expire_snapshot_by_id` method.
(1) apache#2130 with addition of the new `deduplicate_data_files` function to the `MaintenanceTable` class. (2) apache#2151 with the removal of the errant member variable from the `ManageSnapshots` class. (3) apache#2150 by adding the additional functions to be at parity with the Java API.
- **Duplicate File Remediation apache#2130** - Added `deduplicate_data_files` to the `MaintenanceTable` class. - Enables detection and removal of duplicate data files, improving table hygiene and storage efficiency. - **Support `retainLast` and `setMinSnapshotsToKeep` Snapshot Retention Policies apache#2150** - Added new snapshot retention methods to `MaintenanceTable` for feature parity with the Java API: - `retain_last_n_snapshots(n)`: Retain only the last N snapshots. - `expire_snapshots_older_than_with_retention(timestamp_ms, retain_last_n=None, min_snapshots_to_keep=None)`: Expire snapshots older than a timestamp, with additional retention constraints. - `expire_snapshots_with_retention_policy(timestamp_ms=None, retain_last_n=None, min_snapshots_to_keep=None)`: Unified retention policy supporting time-based and count-based constraints. - All retention logic respects protected snapshots (branches/tags) and includes guardrails to prevent over-aggressive expiration. ### Bug Fixes & Cleanups - **Remove unrelated instance variable from the `ManageSnapshots` class apache#2151** - Removed an errant member variable from the `ManageSnapshots` class, aligning the implementation with the intended design and the Java reference. ### Testing & Documentation - Consolidated all snapshot expiration and retention tests into a single file (`test_retention_strategies.py`), covering: - Basic expiration by ID and timestamp. - Protection of branch/tag snapshots. - Retention guardrails and combined policies. - Deduplication of data files. - Added and updated documentation to describe all new retention strategies, deduplication, and API parity improvements.
…intenance operations
…Table The deduplicate_data_files() method was not properly removing duplicate data file references from Iceberg tables. After deduplication, multiple references to the same data file remained instead of the expected single reference. Root causes: 1. _get_all_datafiles() was scanning ALL snapshots instead of current only 2. Incorrect transaction API usage that didn't leverage snapshot updates 3. Missing proper overwrite logic to create clean deduplicated snapshots Key fixes: - Modified _get_all_datafiles() to scan only current snapshot manifests - Implemented proper transaction pattern using update_snapshot().overwrite() - Added explicit delete_data_file() calls for duplicates + append_data_file() for unique files - Removed unused helper methods _get_all_datafiles_with_context() and _detect_duplicates() Technical details: - Deduplication now operates on ManifestEntry objects from current snapshot only - Files are grouped by basename and first occurrence is kept as canonical reference - New snapshot created atomically replaces current snapshot with deduplicated file list - Proper Iceberg transaction semantics ensure data consistency Tests: All deduplication tests now pass including the previously failing test_deduplicate_data_files_removes_duplicates_in_current_snapshot Fixes: Table maintenance deduplication functionality
…ion context in MaintenanceTable
…nce deduplication tests
Main Changes 1. Deduplication Logic Improvements Fixed MaintenanceTable._get_all_datafiles() to properly handle DataFile objects Improved handling of duplicate file references in current snapshot Added proper SQLite connection cleanup in tests Addressed resource warnings and connection leaks 2. Retention Strategy Optimization Consolidated snapshot expiration logic Fixed protected snapshot identification Improved refs handling for branch and tag snapshots Added comprehensive test coverage for retention scenarios 3. Code Quality & Test Infrastructure Added proper Apache license headers to test files Fixed test cleanup and resource management Improved test assertions and error messages Enhanced integration test setup PR Review Responses Resource Management ✅ Added proper connection cleanup in test_deduplicate_data_files_removes_duplicates_in_current_snapshot ✅ Fixed SQLite connection leaks in tests Code Duplication ✅ Consolidated duplicate code between _get_protected_snapshot_ids implementations ✅ Improved reuse of common functionality Test Coverage ✅ Added comprehensive tests for retention strategies ✅ Enhanced deduplication test cases ✅ Improved test assertions and error handling Documentation ✅ Added detailed docstrings ✅ Improved code comments ✅ Added proper license headers Testing Status ✅ All deduplication tests passing ✅ All retention strategy tests passing ✅ Integration tests configured (pending pyarrow dependency fix) ✅ No resource warnings or connection leaks
…connections using engine.dispose() in test fixtures Test Resource Management: Added try/finally blocks to ensure cleanup happens even if tests fail Catalog Connection Handling: Modified both the iceberg_catalog and prepopulated_table fixtures to properly clean up database connections Mock Catalog Cleanup: Added cleanup for tests that replace the table catalog with mock objects
…ntenance and test files (ran: make lint)
…parate pr as suggested.
…dd corresponding tests
…e deduplication logic
…a files - Added a new method `rebuild_current_snapshot` in the MaintenanceTable class to create a new snapshot with unique data files, removing duplicates while preserving unique entries. - Integrated retry logic using the `tenacity` library to handle transient commit failures during the rebuild process. - Enhanced the `_get_all_datafiles` method to utilize parallel processing for manifest file handling. - Introduced comprehensive unit tests to validate the functionality of the new method, including scenarios with and without duplicates, as well as retry mechanisms. - Updated `pyproject.toml` to include `cython` as a dependency and added mypy overrides for various modules to suppress missing import errors.
…mmary and remove rebuild_current_snapshot method
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Rationale for this change
This PR addresses deduplicating snapshot metadata as outlined in #2130. Currently, this PR is stacked on and depends on merging of #2142.
Are these changes tested?
Yes.
Are there any user-facing changes?
No.