-
Notifications
You must be signed in to change notification settings - Fork 0
TIMX 526 - projected views #159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pull Request Test Coverage Report for Build 16837003612Details
💛 - Coveralls |
| self._attach_database_file(conn) | ||
| self._create_append_deltas_view(conn) | ||
| self._create_records_union_view(conn) | ||
| self._create_current_records_view(conn) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the primary work of setting up our DuckDB context. Each of these builds tables and views in the in-memory connection. Note that each one of these is virtually instant, where no data is transferred; they are all effectively "lazy" tables and views.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small note: in TIMX-529, these will be slotted under a metadata schema in the DuckDB context. Minor change, but helps when we get into SQL queries where we'll also have a new data schema.
Why these changes are being introduced: Much of the refactor work has been building to provide metadata views for all records and the current version of a given TIMDEX record, views we had previously but calculated on demand each time. How this addresses that need: When setting up the DuckDB context for TIMDEXDatasetMetadata, we create views that build from a) the static metadata database file and b) the append deltas, providing a projection over all metadata records. Two primary views are added: 'records': all records in the ETL parquet dataset 'current_records': filter to the most recent version of any timdex_record_id from 'records' These views will provide the metadata for future work that (re)implements filtering to current records during read. Side effects of this change: * Views are created on TIMDEXDatasetMetadata initialization Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-526
Why these changes are being introduced: The test suite was built piecemeal as the library grew, and over time the fixture names were becoming clunky and confusing. How this addresses that need: Rename, simplify, and reorganize test fixtures. This requires coordinated changes in tests, nearly entirely just pointing at new fixture names. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-526
05383bc to
43e5350
Compare
jonavellecuerdo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approving with one minor question + acknowledgement that I will continue to look at the test suite with the upcoming PRs and when I work on TIMX-528.
| logger.warning( | ||
| f"Static metadata database not found @ '{self.metadata_database_path}'. " | ||
| "Please recreate via TIMDEXDatasetMetadata.recreate_database_file()." | ||
| ) | ||
| return conn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Realized first half of this comment was lost in the mix as the thread focused on the "static" part of the method name. Re-asking question:
Is TIMDEXDatasetMetadata.recreate_static_database_file() also used when the database is created for the first time? 🤔 If so, I feel the name is a bit misleading.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would maintain that "static" still works! As in, "I'm going to use a static SQLite file I expect to be at /foo/db.sqlite.... oh no! the file isn't there! my static database is missing!"
It's not "static" as in never changing, it's static as in a single, encapsulated database file.
In Django/Rails you could -- and very often do! -- try and reference a "static" file that doesn't exist.
Purpose and background context
This PR takes the work from TIMX-530 (create static metadata database file) and TIMX-527 (writing append deltas) and now provides dynamic database views that project over these data sources.
The following sketch attempts to show all the primary components at this point:
TIMDEXDatasetMetadatais initializedrecordsandcurrent_recordsare designed to be used directly, the rest are building blocksNote: the relatively high line count change is mostly part of this commit which refactors the test fixtures which is 99% just moving and renaming.
Next steps:
TIMDEXDataset.load()and dataset "location", TIMX-533How can a reviewer manually see the effects of these changes?
1- Set AWS Dev
TimdexManagerscredentials2- Set env vars:
3- Start Ipython shell with
pipenv run ipythonand do some setup:4- Fully recreate dataset metadata for a clean slate:
5- Simulate an ETL write while will write some append deltas:
After both
TIMDEXDataset.write()methods, the metadata context has been updated. This allows immediate metadata querying based on the new append deltas.5- Perform some queries that demonstrate append deltas as present and used:
The last query is particularly neat. This is demonstrating the use of pure SQL to identify TIMDEX rows (the simulated runs were in the future), and it's pulling from data we haven't yet merged into the static metadata database file.
The work if a future PR, TIMX-529, will be to allow retrieving full records (e.g. source and transformed data) using similar SQL.
Includes new or updated dependencies?
NO
Changes expectations for external applications?
NO
What are the relevant tickets?