Skip to content

fix(web-analytics): handle multi-set partitions on asset backfills #35231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Jul 18, 2025

Conversation

lricoy
Copy link
Member

@lricoy lricoy commented Jul 17, 2025

Problem

We usually use single day partitions but we can run backfill for multiple partitions, specially on dev or the US cluster. We need to make sure we're dropping all partitions on that timeframe correctly.

Changes

  • Handle multi-date-range partition dropping correctly

How did you test this code?

Manually and covering the unit tests

Did you write or update any docs for this change?


👉 Stay up-to-date with PostHog coding conventions for a smoother review.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greptile Summary

This PR fixes an important issue with partition handling in web analytics pre-aggregation, specifically addressing multi-day partition scenarios. Previously, when running backfills across multiple days, only the first day's partition was being dropped, which could lead to data inconsistencies. The changes ensure that all partitions within a given date range are properly dropped before new data insertion.

Key changes:

  • Modified partition dropping logic to iterate through each day in the date range
  • Added comprehensive test coverage for various partition scenarios (single day, multi-day, week, month)
  • Improved error handling and logging for partition operations
  • Added test cases for edge conditions like month/year transitions and invalid inputs

Confidence score: 5/5

  1. This PR is extremely safe to merge as it adds proper handling for a missing edge case with extensive test coverage
  2. The high score is justified by the comprehensive test suite covering various scenarios and edge cases, with proper error handling in place
  3. The changes are focused and well-contained, with the most attention needed on:
    • dags/web_preaggregated_daily.py - verify partition dropping logic
    • dags/tests/test_web_preaggregated_partitions.py - ensure all critical scenarios are covered

3 files reviewed, 1 comment
Edit PR Review Bot Settings | Greptile

@lricoy lricoy enabled auto-merge (squash) July 17, 2025 23:22
@lricoy lricoy requested a review from rafaeelaudibert July 17, 2025 23:22
@lricoy lricoy changed the title fix(web-analytics): handle multi-set partition correctly fix(web-analytics): handle multi-set partitions for backfill Jul 17, 2025
@lricoy lricoy changed the title fix(web-analytics): handle multi-set partitions for backfill fix(web-analytics): handle multi-set partitions on backfills Jul 17, 2025
@lricoy lricoy changed the title fix(web-analytics): handle multi-set partitions on backfills fix(web-analytics): handle multi-set partitions on asset backfills Jul 17, 2025
@lricoy lricoy requested review from a team, robbie-c and jabahamondes and removed request for a team July 17, 2025 23:43
# Partition might not exist when running for the first time or when running in a empty backfill, which is fine
context.log.info(f"Partition for {date_start} doesn't exist or couldn't be dropped: {drop_error}")
# For time windows: start is inclusive, end is exclusive (except for single-day partitions)
while current_date < end_date or (current_date == start_datetime.date() == end_date):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a limit on the difference of days? like could current_date = end_date - a lot days ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, can't be more than the partition definition:

partition_def = DailyPartitionsDefinition(start_date="2020-01-01")

This could still be a lot of days and although this won't be the usual case, it is possible that we may want to execute it like that

Copy link
Contributor

@jabahamondes jabahamondes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, small comment but not blocker just a warning 🚀

@lricoy lricoy merged commit a4c6ea1 into master Jul 18, 2025
167 checks passed
@lricoy lricoy deleted the fix/handle-multi-set-partition branch July 18, 2025 17:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants