Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Jun 25, 2025

NOTE: this PR relies on #150 which introduces the TIMDEXDatasetMetadata class. Merged!

Purpose and background context

Summary of the migration:

"""
Description:

This migration will ensure that all rows, for all parquet files, for a given ETL run have
the same run_timestamp.

When migration 001 was performed, it did not take into consideration that ETL runs with
100k+ records -- e.g. full Alma or DSpace -- would span multiple parquet files.  With
multiple parquet files came multiple S3 object dates which is what was used to backfill
that run_timestamp column.

This discovery led to TIMX-509, which ensures that a single run_timestamp can be applied
to all rows / files for a given ETL run.  Now that TIMX-509 is complete and deployed,
ensuring all *future* rows are written correctly, this migration is needed to update
a subset of run_timestamp values from migration 001.

The approach is fairly simple and uses the new TIMDEXDatasetMetadata class:

1. retrieve metadata for all records
2. for a given ETL run (run_id), find the earliest run_timestamp
3. apply that run_timestamp to all rows / files for that run_id
"""

If and when approved, this migration will be run manually during a window in the middle of the day when the TIMDEX StepFunction is not running (though it would not be problematic if it was).

How can a reviewer manually see the effects of these changes?

This migration has been applied to a clone of the dataset in Dev at s3://timdex-extract-dev-222053980223/dataset_backups/prod-2025-06-07/.

The following shows counts of records and current records before the update:

┌───────────────────┬───────────────┬─────────────────┐
│      source       │ total_records │ current_records │
│      varchar      │     int64     │      int64      │
├───────────────────┼───────────────┼─────────────────┤
│ alma              │       4633029 │          667106 │  <-----
│ aspace            │          3448 │            1300 │
│ dspace            │        149443 │           47636 │        
│ gismit            │          2061 │            2061 │
│ gisogm            │        181582 │           46075 │
│ libguides         │          1274 │             371 │
│ researchdatabases │          3912 │             907 │
└───────────────────┴───────────────┴─────────────────┘

Note the low number of current records for alma. Due to how we identify current records -- sorting by this important run_timestamp column -- the different run_timestamp values for a large, full alma incorrectly suggest only the last parquet file written is the "most current" full run.

The following show counts after the migration is applied, where all rows / files for a given ETL run have the same run_timestamp, and thus our identifying of current records is accurate:

┌───────────────────┬───────────────┬─────────────────┐
│      source       │ total_records │ current_records │
│      varchar      │     int64     │      int64      │
├───────────────────┼───────────────┼─────────────────┤
│ alma              │       4572861 │         3906194 │  <-----
│ aspace            │          3341 │            1300 │
│ dspace            │        148869 │          146034 │
│ gismit            │          2061 │            2061 │
│ gisogm            │        168157 │          114019 │
│ libguides         │          1196 │             371 │
│ researchdatabases │          3902 │             906 │
└───────────────────┴───────────────┴─────────────────┘

NOTE: the total_records count is slightly off, as the pre-fix counts are actually from a more recent dataset clone.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

YES: after the migration is applied any context that is accessing current records will be accurate

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

@ghukill ghukill marked this pull request as ready for review June 27, 2025 15:04
@ghukill ghukill requested a review from a team June 27, 2025 15:04
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed the migration script and the query with @ghukill and it's all making sense to me. Approved! 🎉

@ghukill ghukill merged commit 9298e7e into main Jun 27, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants