TIMX 508 - run timestamp data migration #151
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
NOTE:
this PR relies on #150 which introduces theMerged!TIMDEXDatasetMetadataclass.Purpose and background context
Summary of the migration:
If and when approved, this migration will be run manually during a window in the middle of the day when the TIMDEX StepFunction is not running (though it would not be problematic if it was).
How can a reviewer manually see the effects of these changes?
This migration has been applied to a clone of the dataset in Dev at
s3://timdex-extract-dev-222053980223/dataset_backups/prod-2025-06-07/.The following shows counts of records and current records before the update:
Note the low number of current records for
alma. Due to how we identify current records -- sorting by this importantrun_timestampcolumn -- the differentrun_timestampvalues for a large, fullalmaincorrectly suggest only the last parquet file written is the "most current" full run.The following show counts after the migration is applied, where all rows / files for a given ETL run have the same
run_timestamp, and thus our identifying of current records is accurate:NOTE: the
total_recordscount is slightly off, as the pre-fix counts are actually from a more recent dataset clone.Includes new or updated dependencies?
NO
Changes expectations for external applications?
YES: after the migration is applied any context that is accessing current records will be accurate
What are the relevant tickets?
Developer
Code Reviewer(s)