-
Notifications
You must be signed in to change notification settings - Fork 0
TIMX 497 - filtering current records #145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Why these changes are being introduced: Formerly, an instance of TIMDEXRunManager expected a TIMDEXDataset on init, where it would utilize the pyarrow TIMDEXDataset.dataset. This results in an unneeded tightly coupling betweent these classes. How this addresses that need: * TIMDEXRunManager updated to only expect a pyarrow Dataset Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-496
Why these changes are being introduced: Unexpected behavior was possible when using load(current_records=True) and then applying additional filtering to the dataset before reading. In short, a non-current record could be yielded if filtering removed the truly current version of the record. This happened because the reverse chronological marking of "seen" records would not "see" this record and happily yield an older version. How this addresses that need: When load(current_records=True) is used, a clone of the dataset is saved to the TIMDEXDataset object before any additional filtering is applied. This dataset is just metadata, not expensive to store. Then, during any read methods, this dataset is used to provide an exhaustive and ordered list of timdex_record_ids. Even if a record has been filtered out by the read method (e.g. limiting records to only action="index"), this secondary list of timdex_record_ids is used as the authoritative list of "seen" timdex_record_ids. There is a bit of network overhead to this parallel batch reading, but fairly minimal as we are only retrieving the 'timdex_record_id'; perhaps 1-2mb of IO per millions of records. Side effects of this change: * Applications like TIM that will likely use this new functionality to yield only "current" records can do so confidently, and optionally with additional filtering, knowing they will only encounter current versions of a record from the dataset. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-497
Pull Request Test Coverage Report for Build 15309482286Details
💛 - Coveralls |
| assert df.action.value_counts().to_dict() == {"index": 99} | ||
|
|
||
|
|
||
| def test_dataset_current_records_index_filtering_accurate_records_yielded( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test is the most verbose and percise in pinpointing how these changes address the issue.
|
@jonavellecuerdo, @ehanson8 - I saved this logging output during the development work, thinking it might help illustrate how this code change works. At the risk of complicating things, opting to include it here. NOTE: The records and counts here are not identical to the test discussed above, but used The most illuminating lines are the 2nd and 3rd I think. Imagine this is moving through runs and batches like normal. Some line commentary
And finally, the dataframe at the end shows that we never yielded |
ehanson8
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, few optional suggestions!
timdex_dataset_api/dataset.py
Outdated
|
|
||
| def _yield_deduped_batches( | ||
| self, batches: Iterator[pa.RecordBatch] | ||
| def _yield_current_record_deduped_batches( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Optional: I think _yield_current_record_batches or even _yield_current_records would be clear enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My inclination is to leave as-is to try and communicate that a) it's still related to "batches" and b) the work is getting performed within each batch.
I worry that _yield_current_record_batches() suggests some batches are skipped entirely, and _yield_current_records() loses the idea that it's yielding batches still.
But open to additional feedback and discussion, it has so far defied a nice tidy name...
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me!
|
@jonavellecuerdo - new commit with updated "maybe filtered record batches" and "definitely unfiltered id batches". And FYI @ehanson8 , one of your early comments about a method name is included in this change too. Thanks for the feedback all! |
Purpose and background context
This PR allows for filtering -- either during
.load()or by any read method -- in conjunction with using.load(current_records=True). With these changes, only the most recent version of a record will be yielded, or not at all if filtering removes the most recent version for whatever reason.Jira ticket TIMX-497 has a fairly in-depth explanation of the behavior which will not be restated here.
The mechanics of how this works is outlined:
Ultimately, this should make usage of
.load(current_records=True)less surprising. No matter what kind of filtering is applied --action="index"orrun_date="2025-02-15"orrun_id="abc123"-- any records that are yielded will be the current version of the record in dataset, or not yielded at all if filtering removes that version. Previously, it was possible to encounter non-current versions of a record even if.load(current_records=True)was set.How can a reviewer manually see the effects of these changes?
It's difficult pinpoint a tidy example in the real data, but a test fixture can get at this pretty easily.
The test fixture
dataset_with_runs_locationprovides a dataset for simulatedalmaanddspacerows. Focusing onalma, we see these "runs" (number of records,source,run_date,run_type,action(for all records in run),run_id):[ (40, "alma", "2024-12-01", "full", "index", "run-1"), (20, "alma", "2024-12-15", "daily", "index", "run-2"), (100, "alma", "2025-01-01", "full", "index", "run-3"), (50, "alma", "2025-01-02", "daily", "index", "run-4"), (25, "alma", "2025-01-03", "daily", "index", "run-5"), (10, "alma", "2025-01-04", "daily", "delete", "run-6"), (9, "alma", "2025-01-05", "daily", "index", "run-7"), ]Runs
run-5throughrun-7are worth focusing on. In natural language, these runs did the following:run-5created/modified 25 records,alma:0-alma:24in the datasetrun-6deleted 10 records,alma:0-alma:9run-7recreated 9 records,alma:0-alma:8We can kind of ignore
run-1throughrun-4for this example.The net effect of these runs is that
alma:9no longer exists in Opensearch. The most recent version of it in the dataset is fromrun-6where it hadaction="delete".So, what happens if we do the following?
Before the changes in this PR
alma:9fromrun-5would have inaccurately still been yielded. That's because filtering fully removedrun-6, and so we had not "seen" it yet by the time we reached that run.After the changes in this PR,
alma:9is no longer yielded based on this filtering. The "current" version ofalma:9isaction="delete"so any kind ofaction="index"filtering must necessarily not include it.Any kind of manual testing or runs should work, yield records, etc. But all said, it's this test case scenario that best exemplifies the purpose and effect of this PR.
Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO: nothing is using this new functionality yet
What are the relevant tickets?
Developer
Code Reviewer(s)