Skip to content

Feature/metadata deduplication 2130 #2216

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 61 commits into
base: main
Choose a base branch
from

Conversation

ForeverAngry
Copy link
Contributor

@ForeverAngry ForeverAngry commented Jul 16, 2025

Rationale for this change

This PR addresses deduplicating snapshot metadata as outlined in #2130. Currently, this PR is stacked on and depends on merging of #2142.

Are these changes tested?

Yes.

Are there any user-facing changes?

No.

ForeverAngry and others added 30 commits March 28, 2025 20:23
…h a new Expired Snapshot class. updated tests.
 ValueError: Cannot expire snapshot IDs {3051729675574597004} as they are currently referenced by table refs.
Moved expiration-related methods from `ExpireSnapshots` to `ManageSnapshots` for improved organization and clarity.

Updated corresponding pytest tests to reflect these changes.
Re-ran the `poetry run pre-commit run --all-files` command on the project.
Re-ran the `poetry run pre-commit run --all-files` command on the project.
Moved: the functions for expiring snapshots to their own class.
…ng it in a separate issue.

Fixed: unrelated changes caused by afork/branch sync issues.
Implemented logic to protect the HEAD branches or Tagged branches from being expired by the `expire_snapshot_by_id` method.
(1)
apache#2130 with addition of the new `deduplicate_data_files` function to the `MaintenanceTable` class.

(2) apache#2151 with the removal of the errant member variable from the `ManageSnapshots` class.

(3) apache#2150 by adding the additional functions to be at parity with the Java API.
- **Duplicate File Remediation apache#2130**
  - Added `deduplicate_data_files` to the `MaintenanceTable` class.
  - Enables detection and removal of duplicate data files, improving table hygiene and storage efficiency.

- **Support `retainLast` and `setMinSnapshotsToKeep` Snapshot Retention Policies apache#2150**
  - Added new snapshot retention methods to `MaintenanceTable` for feature parity with the Java API:
    - `retain_last_n_snapshots(n)`: Retain only the last N snapshots.
    - `expire_snapshots_older_than_with_retention(timestamp_ms, retain_last_n=None, min_snapshots_to_keep=None)`: Expire snapshots older than a timestamp, with additional retention constraints.
    - `expire_snapshots_with_retention_policy(timestamp_ms=None, retain_last_n=None, min_snapshots_to_keep=None)`: Unified retention policy supporting time-based and count-based constraints.
  - All retention logic respects protected snapshots (branches/tags) and includes guardrails to prevent over-aggressive expiration.

### Bug Fixes & Cleanups

- **Remove unrelated instance variable from the `ManageSnapshots` class apache#2151**
  - Removed an errant member variable from the `ManageSnapshots` class, aligning the implementation with the intended design and the Java reference.

### Testing & Documentation

- Consolidated all snapshot expiration and retention tests into a single file (`test_retention_strategies.py`), covering:
  - Basic expiration by ID and timestamp.
  - Protection of branch/tag snapshots.
  - Retention guardrails and combined policies.
  - Deduplication of data files.
- Added and updated documentation to describe all new retention strategies, deduplication, and API parity improvements.
…Table

The deduplicate_data_files() method was not properly removing duplicate
data file references from Iceberg tables. After deduplication, multiple
references to the same data file remained instead of the expected single
reference.

Root causes:
1. _get_all_datafiles() was scanning ALL snapshots instead of current only
2. Incorrect transaction API usage that didn't leverage snapshot updates
3. Missing proper overwrite logic to create clean deduplicated snapshots

Key fixes:
- Modified _get_all_datafiles() to scan only current snapshot manifests
- Implemented proper transaction pattern using update_snapshot().overwrite()
- Added explicit delete_data_file() calls for duplicates + append_data_file() for unique files
- Removed unused helper methods _get_all_datafiles_with_context() and _detect_duplicates()

Technical details:
- Deduplication now operates on ManifestEntry objects from current snapshot only
- Files are grouped by basename and first occurrence is kept as canonical reference
- New snapshot created atomically replaces current snapshot with deduplicated file list
- Proper Iceberg transaction semantics ensure data consistency

Tests: All deduplication tests now pass including the previously failing
test_deduplicate_data_files_removes_duplicates_in_current_snapshot

Fixes: Table maintenance deduplication functionality
Main Changes
1. Deduplication Logic Improvements
Fixed MaintenanceTable._get_all_datafiles() to properly handle DataFile objects
Improved handling of duplicate file references in current snapshot
Added proper SQLite connection cleanup in tests
Addressed resource warnings and connection leaks
2. Retention Strategy Optimization
Consolidated snapshot expiration logic
Fixed protected snapshot identification
Improved refs handling for branch and tag snapshots
Added comprehensive test coverage for retention scenarios
3. Code Quality & Test Infrastructure
Added proper Apache license headers to test files
Fixed test cleanup and resource management
Improved test assertions and error messages
Enhanced integration test setup
PR Review Responses
Resource Management

✅ Added proper connection cleanup in test_deduplicate_data_files_removes_duplicates_in_current_snapshot
✅ Fixed SQLite connection leaks in tests
Code Duplication

✅ Consolidated duplicate code between _get_protected_snapshot_ids implementations
✅ Improved reuse of common functionality
Test Coverage

✅ Added comprehensive tests for retention strategies
✅ Enhanced deduplication test cases
✅ Improved test assertions and error handling
Documentation

✅ Added detailed docstrings
✅ Improved code comments
✅ Added proper license headers
Testing Status
✅ All deduplication tests passing
✅ All retention strategy tests passing
✅ Integration tests configured (pending pyarrow dependency fix)
✅ No resource warnings or connection leaks
…connections using engine.dispose() in test fixtures

Test Resource Management: Added try/finally blocks to ensure cleanup happens even if tests fail

Catalog Connection Handling: Modified both the iceberg_catalog and prepopulated_table fixtures to properly clean up database connections

Mock Catalog Cleanup: Added cleanup for tests that replace the table catalog with mock objects
…a files

- Added a new method `rebuild_current_snapshot` in the MaintenanceTable class to create a new snapshot with unique data files, removing duplicates while preserving unique entries.
- Integrated retry logic using the `tenacity` library to handle transient commit failures during the rebuild process.
- Enhanced the `_get_all_datafiles` method to utilize parallel processing for manifest file handling.
- Introduced comprehensive unit tests to validate the functionality of the new method, including scenarios with and without duplicates, as well as retry mechanisms.
- Updated `pyproject.toml` to include `cython` as a dependency and added mypy overrides for various modules to suppress missing import errors.
…mmary and remove rebuild_current_snapshot method
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant