Skip to content

Conversation

@ghukill
Copy link
Contributor

@ghukill ghukill commented Aug 7, 2025

Purpose and background context

This PR takes the work from TIMX-530 (create static metadata database file) and TIMX-527 (writing append deltas) and now provides dynamic database views that project over these data sources.

The following sketch attempts to show all the primary components at this point:

timx_526_model
  • a single DuckDB context (connection) can be "attached" to multiple databases
    • we attach remotely in a readonly fashion to the static metadata file in S3
    • the other in-memory database is provided by default on connect, and is the primary database used
  • the "TIMDEX Dataset" in the middle shows the data as it sits in S3
  • the "DuckDB Context" on the right shows what the TDA library creates
    • this existed prior to this PR, but had minimal tables + views
    • this PR more fully realizes this DuckDB context
    • this "context" is purely in memory, with no data transferred until queries are performed, each time TIMDEXDatasetMetadata is initialized
    • only the records and current_records are designed to be used directly, the rest are building blocks

Note: the relatively high line count change is mostly part of this commit which refactors the test fixtures which is 99% just moving and renaming.

Next steps:

  • Fully rework TIMDEXDataset.load() and dataset "location", TIMX-533
  • Rework read methods to use SQL + metadata, TIMX-529
  • Merge append deltas into static DB, TIMX-528

How can a reviewer manually see the effects of these changes?

1- Set AWS Dev TimdexManagers credentials

2- Set env vars:

TDA_LOG_LEVEL=DEBUG
WARNING_ONLY_LOGGERS=asyncio,botocore,urllib3,s3transfer,boto3,MARKDOWN
TIMDEX_DATASET_LOCATION=s3://timdex-extract-dev-222053980223/dataset_scratch

3- Start Ipython shell with pipenv run ipython and do some setup:

import os

from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.config import configure_dev_logger
from tests.utils import generate_sample_records

configure_dev_logger()

td = TIMDEXDataset(os.environ["TIMDEX_DATASET_LOCATION"])

4- Fully recreate dataset metadata for a clean slate:

td.metadata.recreate_static_database_file()

5- Simulate an ETL write while will write some append deltas:

td.write(
    generate_sample_records(
        num_records=1000,
        source="gismit",
        run_date="2025-09-01",
        run_type="full",
    )
)
td.write(
    generate_sample_records(
        num_records=1,
        source="gismit",
        run_date="2025-09-02",
        run_type="daily",
        action="delete",
    )
)

After both TIMDEXDataset.write() methods, the metadata context has been updated. This allows immediate metadata querying based on the new append deltas.

5- Perform some queries that demonstrate append deltas as present and used:

# show tables/views
td.metadata.conn.query("""show tables;""")
"""
Out[4]: 
┌─────────────────┐
│      name       │
│     varchar     │
├─────────────────┤
│ append_deltas   │
│ current_records │
│ records         │
└─────────────────┘
"""

td.metadata.conn.query("""select count(*) from append_deltas;""")
"""
Out[5]: 
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│         1001 │
└──────────────┘
"""

# count distinct filenames in append deltas, expecting two from the two ETL writes
 td.metadata.conn.query("""select distinct filename from append_deltas;""")
"""
Out[6]: 
┌────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┐
│                                                                  filename                                                                  │
│                                                                  varchar                                                                   │
├────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ s3://timdex-extract-dev-222053980223/dataset_scratch/data/records/year=2025/month=09/day=02/3023210d-d3d3-4a06-9bd4-26bac33a7d3d-0.parquet │
│ s3://timdex-extract-dev-222053980223/dataset_scratch/data/records/year=2025/month=09/day=01/49652d90-1616-47da-bfd3-a355e0d43109-0.parquet │
└────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘
"""

# select metadata records that will pull from append deltas, but using the records table which is
# a union of static + append delta data
td.metadata.conn.query("""select count(*) from records where run_date > '2025-08-07';""")
"""
Out[7]: 
┌──────────────┐
│ count_star() │
│    int64     │
├──────────────┤
│         1001 │
└──────────────┘
"""

The last query is particularly neat. This is demonstrating the use of pure SQL to identify TIMDEX rows (the simulated runs were in the future), and it's pulling from data we haven't yet merged into the static metadata database file.

The work if a future PR, TIMX-529, will be to allow retrieving full records (e.g. source and transformed data) using similar SQL.

Includes new or updated dependencies?

NO

Changes expectations for external applications?

NO

What are the relevant tickets?

@ghukill ghukill changed the base branch from main to epic-TIMX-515 August 7, 2025 13:40
@coveralls
Copy link

coveralls commented Aug 7, 2025

Pull Request Test Coverage Report for Build 16837003612

Details

  • 33 of 33 (100.0%) changed or added relevant lines in 2 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.7%) to 94.558%

Totals Coverage Status
Change from base Build 16836984985: 0.7%
Covered Lines: 417
Relevant Lines: 441

💛 - Coveralls

@ghukill ghukill changed the base branch from epic-TIMX-515 to TIMX-527-write-append-deltas August 7, 2025 13:58
Comment on lines +283 to +286
self._attach_database_file(conn)
self._create_append_deltas_view(conn)
self._create_records_union_view(conn)
self._create_current_records_view(conn)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the primary work of setting up our DuckDB context. Each of these builds tables and views in the in-memory connection. Note that each one of these is virtually instant, where no data is transferred; they are all effectively "lazy" tables and views.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small note: in TIMX-529, these will be slotted under a metadata schema in the DuckDB context. Minor change, but helps when we get into SQL queries where we'll also have a new data schema.

ghukill added 3 commits August 8, 2025 13:48
Why these changes are being introduced:

Much of the refactor work has been building to provide metadata
views for all records and the current version of a given TIMDEX
record, views we had previously but calculated on demand each time.

How this addresses that need:

When setting up the DuckDB context for TIMDEXDatasetMetadata,
we create views that build from a) the static metadata database
file and b) the append deltas, providing a projection over all
metadata records.

Two primary views are added:
'records': all records in the ETL parquet dataset
'current_records': filter to the most recent version of any
timdex_record_id from 'records'

These views will provide the metadata for future work that
(re)implements filtering to current records during read.

Side effects of this change:
* Views are created on TIMDEXDatasetMetadata initialization

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-526
Why these changes are being introduced:

The test suite was built piecemeal as the library grew,
and over time the fixture names were becoming clunky and
confusing.

How this addresses that need:

Rename, simplify, and reorganize test fixtures.  This
requires coordinated changes in tests, nearly entirely
just pointing at new fixture names.

Side effects of this change:
* None

Relevant ticket(s):
* https://mitlibraries.atlassian.net/browse/TIMX-526
@ghukill ghukill force-pushed the TIMX-526-projected-views branch from 05383bc to 43e5350 Compare August 8, 2025 17:48
@ghukill ghukill changed the base branch from TIMX-527-write-append-deltas to epic-TIMX-515 August 8, 2025 17:49
@ghukill ghukill marked this pull request as ready for review August 8, 2025 17:49
@jonavellecuerdo jonavellecuerdo self-assigned this Aug 8, 2025
Copy link
Contributor

@jonavellecuerdo jonavellecuerdo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving with one minor question + acknowledgement that I will continue to look at the test suite with the upcoming PRs and when I work on TIMX-528.

Comment on lines 277 to +281
logger.warning(
f"Static metadata database not found @ '{self.metadata_database_path}'. "
"Please recreate via TIMDEXDatasetMetadata.recreate_database_file()."
)
return conn
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Realized first half of this comment was lost in the mix as the thread focused on the "static" part of the method name. Re-asking question:

Is TIMDEXDatasetMetadata.recreate_static_database_file() also used when the database is created for the first time? 🤔 If so, I feel the name is a bit misleading.

Copy link
Contributor Author

@ghukill ghukill Aug 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maintain that "static" still works! As in, "I'm going to use a static SQLite file I expect to be at /foo/db.sqlite.... oh no! the file isn't there! my static database is missing!"

It's not "static" as in never changing, it's static as in a single, encapsulated database file.

In Django/Rails you could -- and very often do! -- try and reference a "static" file that doesn't exist.

@ghukill ghukill merged commit 0a80a24 into epic-TIMX-515 Aug 8, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants