Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Dec 5, 2024

Purpose and background context

The primary purpose of this PR is to move the run_id partition before the action partition. Also included are (inconsequential) updates to testing utilities and some updates to logging after write.

Firstly, this re-ordering matches the original ordering proposed in the engineering plan for partitions. Unsure how/why these changed order along the way, but basically reverting to the original proposal.

Second, a more recent discussion surfaced why this ordering makes sense, outlined here as a development note in the engineering plan. In short: where possible, we'll want to avoid loading the entire dataset from the root of the dataset. As the number of parquet files and partitions grow in S3, it will increase the load time to scan files and read their metadata.

Virtually always in a TIMDEX run context we'll be interested in only the files for the current run. And for that run, we'll know the partitions: [source, run_date, run_type, run_id]. We will not know action, as Transmogrifier somewhat dynamically sets this for each record.

Given the partition values we know, and the ability of pyarrow.dataset to load a prefix -- confirmed works for local and S3 locations -- we can load only a subset of the full dataset. The effect is similar to loading the full thing, then filtering based on partitions, but we're short-circuiting the part where lots of files are touched just to be filtered out.

With run_id before `action, we can load a dataset like this nearly instantly, confident it will contain all records for the run:

from pyarrow.dataset import ds
dataset = ds.dataset("s3://timdex-bucket/dataset/source=alma/run-date=2024-12-01/run-type=full/run-id=abc123")

TIMX-425 is focused on the ergonomics of this, where the TIMDEXDataset.load() will allow passing of partition values, but under the hood will apply this prefix approach. All of this hinges on run_id being "above" or "before" the action partition.

How can a reviewer manually see the effects of these changes?

1- start ipython shell

pipenv run ipython

2- create a small dataset

from tests.utils import generate_sample_records_with_simulated_partitions
from timdex_dataset_api import TIMDEXDataset

td = TIMDEXDataset(location="/tmp/test-dataset-ordering")
sample_records = generate_sample_records_with_simulated_partitions(5_000)
written_files = td.write(sample_records)

Releated to logging update, note the logged statistics from the run:

Dataset write complete - elapsed: 0.05s, total files: 99, total rows: 5000, total size: 159248

3- Observe the filename of the parquet file created, how run_id=XXX is before action=index

written_files[0].path
# Out[3]: '/tmp/test-dataset/source=alma/run_date=2024-01-01/run_type=full/run_id=7dc872e8-634b-4804-8580-083a586508d0/action=index/2e99bf8b-f827-45dd-8439-a778425546a8-0.parquet'

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

Developer

  • All new ENV is documented in README
  • All new ENV has been added to staging and production environments
  • All related Jira tickets are linked in commit message(s)
  • Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

  • The commit message is clear and follows our guidelines (not just this PR message)
  • There are appropriate tests covering any new functionality
  • The provided documentation is sufficient for understanding any new functionality introduced
  • Any manual tests have been performed or provided examples verified
  • New dependencies are appropriate or there were no changes

Why these changes are being introduced:

Discussions around retrieving subsets of records from the dataset suggested
that it would be beneficial to have the run_id partition before the
action partition.  This will allow using a prefix approach of partition
names and values when loading a dataset, that will pinpoint a particular
run even before the dataset is fully loaded.

This ordering was originally proposed in the engineering plan for this
library, but it switched somewhere along the way; so moving back to
agreed upon ordering.

How this addresses that need:
* Moves run_id before action in ordered partition columns

Side effects of this change:
* Omitting the action partition, and using everything until the
run_id partition, is sufficient for getting all records from a run.

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-424
@coveralls
Copy link

coveralls commented Dec 5, 2024

Pull Request Test Coverage Report for Build 12186536063

Details

  • 12 of 12 (100.0%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.07%) to 98.592%

Totals Coverage Status
Change from base Build 12185489246: 0.07%
Covered Lines: 140
Relevant Lines: 142

💛 - Coveralls

Copy link

@ehanson8 ehanson8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Greatly appreciate the context for this change, works as expected!

)
yield batch

def log_write_statistics(self, start_time: float) -> None:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very helpful log message!

@ghukill ghukill merged commit 5769260 into main Dec 9, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants