-
Notifications
You must be signed in to change notification settings - Fork 0
TIMX 496 - Add run_timestamp column to dataset #146
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: During work on yielding only "current" records from the dataset, where ordering of the ETL runs in the dataset is critical, it was determined that more time granularity was needed for each ETL run. Currently we store the YYYY-MM-DD for each run, but if multiple runs occur on the same day, we are unable to order them more granularly. How this addresses that need: * Adds new run_timestamp to parquet dataset schema * Timestamp is minted before any runs are written, and then used for each row in the ETL run Side effects of this change: * All TIMDEX components that use this library for reading and writing will need a terraform rebuild to pick up this change. Otherwise, they need no further modification. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-496
Pull Request Test Coverage Report for Build 15349942489Details
💛 - Coveralls |
|
@MITLibraries/dataeng - for the backfill script work, we will have accurate Using |
|
The following is a draft of the script that will be used to backfill the column """
Example usage:
PYTHONPATH=. pipenv run python output/timestamp_backfill/backfill_run_timestamp_column.py \
/Users/ghukill/dev/mit/data/timdex_dataset/prod_small_subset_1 \
output/timestamp_backfill/overrides.csv
--dry-run
"""
import argparse
import json
import os
import boto3
import pandas as pd
import pyarrow.dataset as ds
from timdex_dataset_api.dataset import TIMDEXDataset, TIMDEX_DATASET_SCHEMA
from timdex_dataset_api.config import configure_dev_logger, configure_logger
configure_dev_logger()
logger = configure_logger(__name__)
def backfill_parquet_file(
parquet_filepath: str,
dataset: ds.Dataset,
run_timestamp_overrides: dict,
dry_run: bool = False,
) -> tuple[bool, dict | None | Exception]:
import pyarrow.parquet as pq
import pyarrow as pa
from datetime import datetime, UTC
parquet_file = pq.ParquetFile(parquet_filepath, filesystem=dataset.filesystem)
if "run_timestamp" in parquet_file.schema.names:
logger.info(
f"Parquet file already has 'run_timestamp' column, skipping: {parquet_filepath}"
)
return (True, {"file_path": parquet_filepath})
try:
table = parquet_file.read()
run_id = table.column("run_id")[0].as_py()
# Check if run_id exists in overrides - error if not found
if run_id not in run_timestamp_overrides:
error_msg = f"run_id '{run_id}' not found in timestamp overrides CSV"
logger.error(error_msg)
raise ValueError(error_msg)
run_timestamp_override = run_timestamp_overrides[run_id]
run_timestamp = datetime.fromisoformat(
run_timestamp_override.replace("Z", "+00:00")
)
# create run_timestamp column using the exact schema definition
num_rows = len(table)
run_timestamp_field = TIMDEX_DATASET_SCHEMA.field("run_timestamp")
run_timestamp_array = pa.array(
[run_timestamp] * num_rows, type=run_timestamp_field.type
)
# add the run_timestamp column to the table
table_with_timestamp = table.append_column("run_timestamp", run_timestamp_array)
# write the updated table back to the same file
if not dry_run:
pq.write_table(
table_with_timestamp,
parquet_filepath,
filesystem=dataset.filesystem,
)
update_details = {
"file_path": parquet_filepath,
"run_id": run_id,
"rows_updated": num_rows,
"run_timestamp_added": run_timestamp.isoformat(),
}
return (True, update_details)
except Exception as e:
logger.error(f"Error processing parquet file {parquet_filepath}: {str(e)}")
return (False, str(e))
def backfill_dataset(location: str, timestamp_csv: str, dry_run=False):
td = TIMDEXDataset(location)
td.load()
parquet_files = td.dataset.files
logger.info(f"Found {len(parquet_files)} parquet files in dataset.")
# Always load timestamp overrides CSV (required)
logger.info(f"Loading timestamp overrides CSV: {timestamp_csv}")
overrides_df = pd.read_csv(timestamp_csv)
overrides = dict(overrides_df.values)
logger.info(f"Loaded {len(overrides)} timestamp overrides")
for i, parquet_file in enumerate(parquet_files):
logger.info(
f"Working on parquet file {i + 1}/{len(parquet_files)}: {parquet_file}"
)
result = backfill_parquet_file(
parquet_file, td.dataset, run_timestamp_overrides=overrides, dry_run=dry_run
)
logger.info(json.dumps(result))
def generate_timestamps_overrides_csv(filepath: str) -> str:
state_machine_arn = os.getenv("STEPFUNCTION_ARN")
if not state_machine_arn:
raise ValueError("STEPFUNCTION_ARN environment variable is required")
stepfunctions = boto3.client("stepfunctions")
# get all executions for the state machine
executions = []
paginator = stepfunctions.get_paginator("list_executions")
for page in paginator.paginate(stateMachineArn=state_machine_arn):
executions.extend(page["executions"])
# create dataframe
data = []
for execution in executions:
execution_id = execution["executionArn"].split(":")[-1]
# format timestamp to UTC with microseconds and append
start_date = execution["startDate"]
timestamp_utc = start_date.strftime("%Y-%m-%d %H:%M:%S.%f")
data.append({"run_id": execution_id, "run_timestamp": timestamp_utc})
df = pd.DataFrame(data)
df.to_csv(filepath, index=False)
logger.info(f"Generated timestamp overrides CSV with {len(data)} entries: {filepath}")
return filepath
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Backfill run_timestamp column in TIMDEX parquet files"
)
parser.add_argument(
"--dry-run",
action="store_true",
help="Scan files and report what would be done without making changes",
)
parser.add_argument(
"dataset_location", help="Path to the dataset (local path or s3://bucket/path)"
)
parser.add_argument(
"timestamp_csv",
help="Path where the timestamp CSV will be generated and used for overrides",
)
args = parser.parse_args()
logger.info("Generating timestamp overrides CSV from StepFunction executions...")
timestamp_csv = generate_timestamps_overrides_csv(args.timestamp_csv)
backfill_dataset(
args.dataset_location, timestamp_csv=timestamp_csv, dry_run=args.dry_run
) |
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good and the associated backfill work is well-considered!
tests/test_dataset.py
Outdated
| # assert TIMDEXDataset.write() applies current time as run_timestamp | ||
| row_dict = next(dataset_with_same_day_runs.read_dicts_iter()) | ||
| assert "run_timestamp" in row_dict | ||
| assert row_dict["run_timestamp"] == datetime( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Totally optional: you could use strftime() so it's immediately obvious that it matches @pytest.mark.freeze_time("2025-05-22 01:23:45.567890")
|
@jonavellecuerdo , @ehanson8 - if and when this code change is merged, as a next step I am thinking that I would like to introduce a new root level Example structure: No real bearing on this PR, but noting I'm planning a followup PR for this backfill work as a formalized dataset migration. |
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ghukill I think this is looking good! I have a change request for an update to one of the new test that I think would be helpful in understanding the main reason for this update. 🤓
Purpose and background context
1- TDA Library Code Changes
During work on yielding only "current" records from the dataset, where ordering of the ETL runs in the dataset is critical, it was determined that more time granularity was needed for each ETL run.
Currently we store the
YYYY-MM-DDfor each run in therun_datecolumn, but if multiple runs occur on the same day, we are unable to order them more granularly from this alone.How this addresses that need:
run_timestampto parquet dataset schema2- Bulk data work
In addition to this code change, we will need to retroactively add a
run_timestampto all past ETL runs in the parquet dataset. A python script with onetime code that will be used. This code won't be committed to this repository -- unless it makes sense to build some kind of bulk editing utility into this library? -- but will be shared with the team for input before running.The flow can be summarized as the following:
run_timestamp's for specificrun_id's- if therun_idis not included in the input CSV, set arun_timestampofYYYY-MM-DD 12:00:01:0000; one second after midnight- if therun_idis included in the input CSV, use the explicitrun_timestampprovidedrun_timestampfrom the CSV for therun_idassociated with the parquet fileAn interesting quality of parquet datasets is that "atomic updates" (i.e. adding a new column + value for rows) is not something well supported. So this script will effectively rewrite the dataset, parquet file-by-file, to introduce this new column. The upside to this work is establishing some experience and workflows for performing such a bulk update.
The production parquet dataset has
roughly 400+ parquet files, probably around 10gb total data~597 parquet files @ 10gb. But an interesting dimension here is that most files in the dataset are static and not touched after an ETL run. So we could theoretically be writing these updated files to the dataset even while ETL runs (with the new code + column!) are occurring. Which is all to say, actually performing this work shouldn't require much of any coordination.3- Order of work
run_timestampcolumnt to new parquet filesrun_timestampA strength of parquet datasets is schema evolution, meaning there is nothing inherently wrong with some parquet files missing a column. Any writes after the code goes live will include this new column, but all reads are not expecting or using it yet so it's okay if older files are missing it. This means we are not rushed to run our bulk update script.
How can a reviewer manually see the effects of these changes?
0- Set env vars:
1- Open Ipython shell
2- Perform two "full" writes that occur on the same day
2025-01-01:After these writes, we can see two parquet files under the same day partitions:
3- Load a dataframe with all rows and inspect new
run_timestampcolumn values:run_timestampvaluerun-2is 5 seconds afterrun-1run_timestamphas microsecond accuracy, which should be more than sufficient for even virtually simultaneous ETL runs4- Lastly, utilize
load(current_records=True)to see thatrun_timestampis utilized to correctly to only getrun-2records given it's the most recent "full" run:Includes new or updated dependencies?
YES
Changes expectations for external applications?
YES: All TIMDEX components that use this library for reading and writing will need a terraform rebuild to pick up this change. Otherwise, they need no further modification.
What are the relevant tickets?
Developer
Code Reviewer(s)