Feature/metadata deduplication 2130 #2216

ForeverAngry · 2025-07-16T00:14:28Z

Rationale for this change

This PR addresses deduplicating snapshot metadata as outlined in #2130. Currently, this PR is stacked on and depends on merging of #2142.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

…p data

…h a new Expired Snapshot class. updated tests.

ValueError: Cannot expire snapshot IDs {3051729675574597004} as they are currently referenced by table refs.

Moved expiration-related methods from `ExpireSnapshots` to `ManageSnapshots` for improved organization and clarity. Updated corresponding pytest tests to reflect these changes.

Re-ran the `poetry run pre-commit run --all-files` command on the project.

Moved: the functions for expiring snapshots to their own class.

…ng it in a separate issue. Fixed: unrelated changes caused by afork/branch sync issues.

Co-authored-by: Fokko Driesprong <[email protected]>

Implemented logic to protect the HEAD branches or Tagged branches from being expired by the `expire_snapshot_by_id` method.

…ve obsolete test

(1) apache#2130 with addition of the new `deduplicate_data_files` function to the `MaintenanceTable` class. (2) apache#2151 with the removal of the errant member variable from the `ManageSnapshots` class. (3) apache#2150 by adding the additional functions to be at parity with the Java API.

- **Duplicate File Remediation apache#2130** - Added `deduplicate_data_files` to the `MaintenanceTable` class. - Enables detection and removal of duplicate data files, improving table hygiene and storage efficiency. - **Support `retainLast` and `setMinSnapshotsToKeep` Snapshot Retention Policies apache#2150** - Added new snapshot retention methods to `MaintenanceTable` for feature parity with the Java API: - `retain_last_n_snapshots(n)`: Retain only the last N snapshots. - `expire_snapshots_older_than_with_retention(timestamp_ms, retain_last_n=None, min_snapshots_to_keep=None)`: Expire snapshots older than a timestamp, with additional retention constraints. - `expire_snapshots_with_retention_policy(timestamp_ms=None, retain_last_n=None, min_snapshots_to_keep=None)`: Unified retention policy supporting time-based and count-based constraints. - All retention logic respects protected snapshots (branches/tags) and includes guardrails to prevent over-aggressive expiration. ### Bug Fixes & Cleanups - **Remove unrelated instance variable from the `ManageSnapshots` class apache#2151** - Removed an errant member variable from the `ManageSnapshots` class, aligning the implementation with the intended design and the Java reference. ### Testing & Documentation - Consolidated all snapshot expiration and retention tests into a single file (`test_retention_strategies.py`), covering: - Basic expiration by ID and timestamp. - Protection of branch/tag snapshots. - Retention guardrails and combined policies. - Deduplication of data files. - Added and updated documentation to describe all new retention strategies, deduplication, and API parity improvements.

…tion features

…intenance operations

…of full paths

…Table The deduplicate_data_files() method was not properly removing duplicate data file references from Iceberg tables. After deduplication, multiple references to the same data file remained instead of the expected single reference. Root causes: 1. _get_all_datafiles() was scanning ALL snapshots instead of current only 2. Incorrect transaction API usage that didn't leverage snapshot updates 3. Missing proper overwrite logic to create clean deduplicated snapshots Key fixes: - Modified _get_all_datafiles() to scan only current snapshot manifests - Implemented proper transaction pattern using update_snapshot().overwrite() - Added explicit delete_data_file() calls for duplicates + append_data_file() for unique files - Removed unused helper methods _get_all_datafiles_with_context() and _detect_duplicates() Technical details: - Deduplication now operates on ManifestEntry objects from current snapshot only - Files are grouped by basename and first occurrence is kept as canonical reference - New snapshot created atomically replaces current snapshot with deduplicated file list - Proper Iceberg transaction semantics ensure data consistency Tests: All deduplication tests now pass including the previously failing test_deduplicate_data_files_removes_duplicates_in_current_snapshot Fixes: Table maintenance deduplication functionality

…xpired

…ion context in MaintenanceTable

…nce deduplication tests

Main Changes 1. Deduplication Logic Improvements Fixed MaintenanceTable._get_all_datafiles() to properly handle DataFile objects Improved handling of duplicate file references in current snapshot Added proper SQLite connection cleanup in tests Addressed resource warnings and connection leaks 2. Retention Strategy Optimization Consolidated snapshot expiration logic Fixed protected snapshot identification Improved refs handling for branch and tag snapshots Added comprehensive test coverage for retention scenarios 3. Code Quality & Test Infrastructure Added proper Apache license headers to test files Fixed test cleanup and resource management Improved test assertions and error messages Enhanced integration test setup PR Review Responses Resource Management ✅ Added proper connection cleanup in test_deduplicate_data_files_removes_duplicates_in_current_snapshot ✅ Fixed SQLite connection leaks in tests Code Duplication ✅ Consolidated duplicate code between _get_protected_snapshot_ids implementations ✅ Improved reuse of common functionality Test Coverage ✅ Added comprehensive tests for retention strategies ✅ Enhanced deduplication test cases ✅ Improved test assertions and error handling Documentation ✅ Added detailed docstrings ✅ Improved code comments ✅ Added proper license headers Testing Status ✅ All deduplication tests passing ✅ All retention strategy tests passing ✅ Integration tests configured (pending pyarrow dependency fix) ✅ No resource warnings or connection leaks

…connections using engine.dispose() in test fixtures Test Resource Management: Added try/finally blocks to ensure cleanup happens even if tests fail Catalog Connection Handling: Modified both the iceberg_catalog and prepopulated_table fixtures to properly clean up database connections Mock Catalog Cleanup: Added cleanup for tests that replace the table catalog with mock objects

…ntenance and test files (ran: make lint)

…parate pr as suggested.

…dd corresponding tests

…e deduplication logic

…c and tests

…a files - Added a new method `rebuild_current_snapshot` in the MaintenanceTable class to create a new snapshot with unique data files, removing duplicates while preserving unique entries. - Integrated retry logic using the `tenacity` library to handle transient commit failures during the rebuild process. - Enhanced the `_get_all_datafiles` method to utilize parallel processing for manifest file handling. - Introduced comprehensive unit tests to validate the functionality of the new method, including scenarios with and without duplicates, as well as retry mechanisms. - Updated `pyproject.toml` to include `cython` as a dependency and added mypy overrides for various modules to suppress missing import errors.

…mmary and remove rebuild_current_snapshot method

ForeverAngry and others added 30 commits March 28, 2025 20:23

Added initial units tests and Class for Removing a Snapshot

0a94d96

Added methods needed to expire snapshots by id, and optionally cleanu…

5f0b62b

…p data

Update test_expire_snapshots.py

f995daa

Added the builder method to __init__.py, updated the snapshot api wit…

65365e1

…h a new Expired Snapshot class. updated tests.

Snapshots are not being transacted on, but need to re-assign refs

e28815f

ValueError: Cannot expire snapshot IDs {3051729675574597004} as they are currently referenced by table refs.

Fixed the test case.

4628ede

adding print statements to help with debugging

e80c41c

Draft ready

cb9f0c9

Applied suggestions to Fix CICD

ebcff2b

Merge branch 'main' into main

97399bf

Rebuild the poetry lock file.

95e5af2

Merge branch 'main' into main

5ab5890

Refactor implementation of ExpireSnapshots

5acd690

Moved expiration-related methods from `ExpireSnapshots` to `ManageSnapshots` for improved organization and clarity. Updated corresponding pytest tests to reflect these changes.

Fixed format and linting issues

d30a08c

Re-ran the `poetry run pre-commit run --all-files` command on the project.

Merge branch 'main' into main

e62ab58

Fixed format and linting issues

1af3258

Re-ran the `poetry run pre-commit run --all-files` command on the project.

Merge branch 'main' of https://github.com/ForeverAngry/iceberg-python

352b48f

Merge branch 'main' into main

382e0ea

rebased: from main

549c183

fixed: typo

386cb15

removed errant files

12729fa

Added: public method signature to the init table file.

ce3515c

Moved: the functions for expiring snapshots to their own class.

Removed: expire_snapshots_older_than method, in favor of implementi…

28fce4b

…ng it in a separate issue. Fixed: unrelated changes caused by afork/branch sync issues.

Update tests/table/test_expire_snapshots.py

2c3153e

Co-authored-by: Fokko Driesprong <[email protected]>

Removed: unrelated changes, Added: logic to expire snapshot method.

27c3ece

Implemented logic to protect the HEAD branches or Tagged branches from being expired by the `expire_snapshot_by_id` method.

feat: implement deduplication of data files in Iceberg table and remo…

fe73a34

…ve obsolete test

refactor: remove obsolete expire_snapshots_older_than method

42e55c9

feat: enhance table maintenance with deduplication and snapshot reten…

0e6d45c

…tion features

ForeverAngry added 30 commits July 5, 2025 01:20

Update .gitignore

fba592d

Update test_writes.py

b837f86

Merge branch 'main' into refactor/consolidate-snapshot-expiration

4605a04

refactor: remove obsolete test file for snapshot expiration

536528e

wip: enhance deduplication logic and improve data file handling in ma…

6036e12

…intenance operations

wip - refactor: update deduplication tests to use file names instead …

9dc9c82

…of full paths

fix(tests): ensure commit_table is not called when no snapshots are e…

73658e0

…xpired

refactor: remove unused expire_snapshots method and clean up transact…

a9a01ee

…ion context in MaintenanceTable

refactor: streamline data file retrieval in MaintenanceTable and enha…

8c906d2

…nce deduplication tests

Reverted changes back to prior commit version for _get_all_datafiles

0e72ccc

refactor: simplify snapshot expiration logic and clean up unused imports

cfb4061

Merge branch 'main' into refactor/consolidate-snapshot-expiration

9371bca

fix: add missing newline in API documentation for clarity

881fab9

refactor: update license header in test_retention_strategies.py

acb70da

feat: add license header to test_overwrite_files.py

54c1f7f

Update test_literals.py

4c6f86c

fix: update typing-extensions and mkdocs-material versions

03acf03

fix: update mkdocs-material and typing-extensions versions

55a156f

fix: remove unused parameter from _get_protected_snapshot_ids method

3a5c8e4

fix: remove unnecessary whitespace and improve code formatting in mai…

2e7e4cb

…ntenance and test files (ran: make lint)

Merge branch 'main' into refactor/consolidate-snapshot-expiration

93a79b9

Moved the deduplicate logic found here: apache#2130 (comment) to a se…

6cfc329

…parate pr as suggested.

feat: implement deduplication of data files in MaintenanceTable and a…

9ea7070

…dd corresponding tests

feat: add method to collect all DataFiles across snapshots and enhanc…

bd429dc

…e deduplication logic

fix: clean up whitespace and improve formatting in deduplication logi…

df455ea

…c and tests

refactor: Update deduplicate_data_files method to return operation su…

a0ec514

…mmary and remove rebuild_current_snapshot method

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Feature/metadata deduplication 2130 #2216

Feature/metadata deduplication 2130 #2216

Uh oh!

ForeverAngry commented Jul 16, 2025 •

edited

Loading

Uh oh!

Uh oh!

Feature/metadata deduplication 2130 #2216

Are you sure you want to change the base?

Feature/metadata deduplication 2130 #2216

Uh oh!

Conversation

ForeverAngry commented Jul 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

Are these changes tested?

Are there any user-facing changes?

Uh oh!

Uh oh!

ForeverAngry commented Jul 16, 2025 •

edited

Loading