TIMX 424 - reorder partition columns #12
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Purpose and background context
The primary purpose of this PR is to move the
run_idpartition before theactionpartition. Also included are (inconsequential) updates to testing utilities and some updates to logging after write.Firstly, this re-ordering matches the original ordering proposed in the engineering plan for partitions. Unsure how/why these changed order along the way, but basically reverting to the original proposal.
Second, a more recent discussion surfaced why this ordering makes sense, outlined here as a development note in the engineering plan. In short: where possible, we'll want to avoid loading the entire dataset from the root of the dataset. As the number of parquet files and partitions grow in S3, it will increase the load time to scan files and read their metadata.
Virtually always in a TIMDEX run context we'll be interested in only the files for the current run. And for that run, we'll know the partitions:
[source, run_date, run_type, run_id]. We will not knowaction, as Transmogrifier somewhat dynamically sets this for each record.Given the partition values we know, and the ability of
pyarrow.datasetto load a prefix -- confirmed works for local and S3 locations -- we can load only a subset of the full dataset. The effect is similar to loading the full thing, then filtering based on partitions, but we're short-circuiting the part where lots of files are touched just to be filtered out.With
run_idbefore `action, we can load a dataset like this nearly instantly, confident it will contain all records for the run:TIMX-425 is focused on the ergonomics of this, where the
TIMDEXDataset.load()will allow passing of partition values, but under the hood will apply this prefix approach. All of this hinges onrun_idbeing "above" or "before" theactionpartition.How can a reviewer manually see the effects of these changes?
1- start ipython shell
2- create a small dataset
Releated to logging update, note the logged statistics from the run:
3- Observe the filename of the parquet file created, how
run_id=XXXis beforeaction=indexIncludes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?
Developer
Code Reviewer(s)