-
Notifications
You must be signed in to change notification settings - Fork 1.8k
fix(web-analytics): handle multi-set partitions on asset backfills #35231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Greptile Summary
This PR fixes an important issue with partition handling in web analytics pre-aggregation, specifically addressing multi-day partition scenarios. Previously, when running backfills across multiple days, only the first day's partition was being dropped, which could lead to data inconsistencies. The changes ensure that all partitions within a given date range are properly dropped before new data insertion.
Key changes:
- Modified partition dropping logic to iterate through each day in the date range
- Added comprehensive test coverage for various partition scenarios (single day, multi-day, week, month)
- Improved error handling and logging for partition operations
- Added test cases for edge conditions like month/year transitions and invalid inputs
Confidence score: 5/5
- This PR is extremely safe to merge as it adds proper handling for a missing edge case with extensive test coverage
- The high score is justified by the comprehensive test suite covering various scenarios and edge cases, with proper error handling in place
- The changes are focused and well-contained, with the most attention needed on:
- dags/web_preaggregated_daily.py - verify partition dropping logic
- dags/tests/test_web_preaggregated_partitions.py - ensure all critical scenarios are covered
3 files reviewed, 1 comment
Edit PR Review Bot Settings | Greptile
# Partition might not exist when running for the first time or when running in a empty backfill, which is fine | ||
context.log.info(f"Partition for {date_start} doesn't exist or couldn't be dropped: {drop_error}") | ||
# For time windows: start is inclusive, end is exclusive (except for single-day partitions) | ||
while current_date < end_date or (current_date == start_datetime.date() == end_date): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a limit on the difference of days? like could current_date = end_date - a lot days ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, can't be more than the partition definition:
posthog/dags/web_preaggregated_daily.py
Line 43 in 3bfb9b7
partition_def = DailyPartitionsDefinition(start_date="2020-01-01") |
This could still be a lot of days and although this won't be the usual case, it is possible that we may want to execute it like that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, small comment but not blocker just a warning 🚀
Problem
We usually use single day partitions but we can run backfill for multiple partitions, specially on dev or the US cluster. We need to make sure we're dropping all partitions on that timeframe correctly.
Changes
How did you test this code?
Manually and covering the unit tests
Did you write or update any docs for this change?
👉 Stay up-to-date with PostHog coding conventions for a smoother review.