TIMX 424 - reorder partition columns #12

ghukill · 2024-12-05T18:56:25Z

Purpose and background context

The primary purpose of this PR is to move the run_id partition before the action partition. Also included are (inconsequential) updates to testing utilities and some updates to logging after write.

Firstly, this re-ordering matches the original ordering proposed in the engineering plan for partitions. Unsure how/why these changed order along the way, but basically reverting to the original proposal.

Second, a more recent discussion surfaced why this ordering makes sense, outlined here as a development note in the engineering plan. In short: where possible, we'll want to avoid loading the entire dataset from the root of the dataset. As the number of parquet files and partitions grow in S3, it will increase the load time to scan files and read their metadata.

Virtually always in a TIMDEX run context we'll be interested in only the files for the current run. And for that run, we'll know the partitions: [source, run_date, run_type, run_id]. We will not know action, as Transmogrifier somewhat dynamically sets this for each record.

Given the partition values we know, and the ability of pyarrow.dataset to load a prefix -- confirmed works for local and S3 locations -- we can load only a subset of the full dataset. The effect is similar to loading the full thing, then filtering based on partitions, but we're short-circuiting the part where lots of files are touched just to be filtered out.

With run_id before `action, we can load a dataset like this nearly instantly, confident it will contain all records for the run:

from pyarrow.dataset import ds
dataset = ds.dataset("s3://timdex-bucket/dataset/source=alma/run-date=2024-12-01/run-type=full/run-id=abc123")

TIMX-425 is focused on the ergonomics of this, where the TIMDEXDataset.load() will allow passing of partition values, but under the hood will apply this prefix approach. All of this hinges on run_id being "above" or "before" the action partition.

How can a reviewer manually see the effects of these changes?

1- start ipython shell

pipenv run ipython

2- create a small dataset

from tests.utils import generate_sample_records_with_simulated_partitions
from timdex_dataset_api import TIMDEXDataset

td = TIMDEXDataset(location="/tmp/test-dataset-ordering")
sample_records = generate_sample_records_with_simulated_partitions(5_000)
written_files = td.write(sample_records)

Releated to logging update, note the logged statistics from the run:

Dataset write complete - elapsed: 0.05s, total files: 99, total rows: 5000, total size: 159248

3- Observe the filename of the parquet file created, how run_id=XXX is before action=index

written_files[0].path
# Out[3]: '/tmp/test-dataset/source=alma/run_date=2024-01-01/run_type=full/run_id=7dc872e8-634b-4804-8580-083a586508d0/action=index/2e99bf8b-f827-45dd-8439-a778425546a8-0.parquet'

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-424

Developer

All new ENV is documented in README
All new ENV has been added to staging and production environments
All related Jira tickets are linked in commit message(s)
Stakeholder approval has been confirmed (or is not needed)

Code Reviewer(s)

The commit message is clear and follows our guidelines (not just this PR message)
There are appropriate tests covering any new functionality
The provided documentation is sufficient for understanding any new functionality introduced
Any manual tests have been performed or provided examples verified
New dependencies are appropriate or there were no changes

Why these changes are being introduced: Discussions around retrieving subsets of records from the dataset suggested that it would be beneficial to have the run_id partition before the action partition. This will allow using a prefix approach of partition names and values when loading a dataset, that will pinpoint a particular run even before the dataset is fully loaded. This ordering was originally proposed in the engineering plan for this library, but it switched somewhere along the way; so moving back to agreed upon ordering. How this addresses that need: * Moves run_id before action in ordered partition columns Side effects of this change: * Omitting the action partition, and using everything until the run_id partition, is sufficient for getting all records from a run. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-424

coveralls · 2024-12-05T19:10:26Z

Pull Request Test Coverage Report for Build 12186536063

Details

12 of 12 (100.0%) changed or added relevant lines in 3 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.07%) to 98.592%

Totals
Change from base Build 12185489246:	0.07%
Covered Lines:	140
Relevant Lines:	142

💛 - Coveralls

ehanson8

Greatly appreciate the context for this change, works as expected!

ehanson8 · 2024-12-05T21:12:47Z

timdex_dataset_api/dataset.py

            )
            yield batch
+
+    def log_write_statistics(self, start_time: float) -> None:


Very helpful log message!

ghukill added 3 commits December 5, 2024 13:37

Improve accuracy of simulated records

7bec349

Save written files to TIMDEXDataset object and log

d7a1b29

ghukill requested review from ehanson8 and jonavellecuerdo December 5, 2024 18:57

bump version to v0.2.0

ddbd5fd

ehanson8 approved these changes Dec 5, 2024

View reviewed changes

timdex_dataset_api/dataset.py

)

yield batch

def log_write_statistics(self, start_time: float) -> None:

Copy link

ehanson8 Dec 5, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very helpful log message!

jonavellecuerdo approved these changes Dec 9, 2024

View reviewed changes

ghukill merged commit 5769260 into main Dec 9, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

TIMX 424 - reorder partition columns #12

TIMX 424 - reorder partition columns #12

Uh oh!

ghukill commented Dec 5, 2024 •

edited

Loading

Uh oh!

coveralls commented Dec 5, 2024 •

edited

Loading

Uh oh!

ehanson8 left a comment

Uh oh!

ehanson8 Dec 5, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

TIMX 424 - reorder partition columns #12

TIMX 424 - reorder partition columns #12

Uh oh!

Conversation

ghukill commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Developer

Code Reviewer(s)

Uh oh!

coveralls commented Dec 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 12186536063

Details

💛 - Coveralls

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

ehanson8 Dec 5, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

ghukill commented Dec 5, 2024 •

edited

Loading

coveralls commented Dec 5, 2024 •

edited

Loading