TIMX 529 - read methods utilize SQL + metadata context #161

ghukill · 2025-08-12T13:52:09Z

Purpose and background context

High level, this PR updates all TIMDEXDataset read methods to use DuckDB for record querying and retrieval. This completes the primary work of epic TIMX-515, with remaining tickets focusing on metadata management (e.g. creating the first static database file and merging deltas) and deployment.

The read method signatures remain quite similar, but under the hood the querying and retrieval is following a new two-step process:

Perform a "metadata" query that identifies records and specific parquet files that match the query; fast and lightweight
Perform a "data" query that utilizes that information and streams batches of actual record rows from parquet files

This is sketched in a new "Reading Data from TIMDEXDataset" document under section, "How reading works (two-step process)".

The changes in this PR built on updates from previous PRs, including reinstating some temporarily skipped tests. It is recommended to review by looking at individual commits. The most meaningful changes come from commit "Rework read methods to utilize metadata" which updated the read methods to use dataset metadata + DuckDB for querying and retrieval. Commits that follow that are mostly updating tests, reinstating skipped tests, some work on fixtures to dramatically increase testing time via fixture scoping, and documentation.

Changes for Transmogrifier?

Update dependencies to pickup new TDA library: https://mitlibraries.atlassian.net/browse/TIMX-537

The TIMDEXDataset.write() method signature remains the same, but now append deltas will get added to the dataset as a side effect for all writes; no code changes required.

Changes for Pipeline Lambdas?

Small update to helper function that looks for records to index: https://mitlibraries.atlassian.net/browse/TIMX-535

Changes for TIM?

Updates to CLI to use updated TIMDEXDataset loading and reading: https://mitlibraries.atlassian.net/browse/TIMX-536

How can a reviewer manually see the effects of these changes?

1- Set AWS Dev TimdexManagers credentials

2- Set env vars:

TDA_LOG_LEVEL=DEBUG
WARNING_ONLY_LOGGERS=asyncio,botocore,urllib3,s3transfer,boto3,MARKDOWN
TIMDEX_DATASET_LOCATION=s3://timdex-extract-dev-222053980223/dataset_scratch

3- Start Ipython shell with pipenv run ipython and do some setup:

import os

from timdex_dataset_api import TIMDEXDataset
from timdex_dataset_api.config import configure_dev_logger
from tests.utils import generate_sample_records

configure_dev_logger()

td = TIMDEXDataset(os.environ["TIMDEX_DATASET_LOCATION"])

4- Perform a metadata query to get a sense of some ETL runs:

td.metadata.conn.query(
    """
    select source, run_id, count(*) as row_count
    from metadata.records
    group by source, run_id
    order by row_count desc;
    """
)
"""
┌───────────┬──────────────────────────────────────┬───────────┐
│  source   │                run_id                │ row_count │
│  varchar  │               varchar                │   int64   │
├───────────┼──────────────────────────────────────┼───────────┤
...
│ alma      │ 9fdeab75-03cb-4568-9215-36f8d024de74 │       250 │
...
"""

5- Simulate how TIM will provide a run_id as a "simple" key/value filter to get records to index:

for transformed_record in td.read_transformed_records_iter(
    run_id="9fdeab75-03cb-4568-9215-36f8d024de74",
    action="index",
):
    print(f"Index record: {transformed_record['timdex_record_id']}")
"""
DEBUG:timdex_dataset_api.metadata:SELECT metadata.records.timdex_record_id, metadata.records.run_id, metadata.records.run_record_offset, metadata.records.filename FROM metadata.records WHERE metadata.records.run_id = '9fdeab75-03cb-4568-9215-36f8d024de74' AND metadata.records.action = 'index' ORDER BY metadata.records.filename, metadata.records.run_record_offset
DEBUG:timdex_dataset_api.dataset:Metadata query identified 248 rows, across 2 parquet files, elapsed: 2.44s
"""

This is an interesting one! We no longer need to provide source or even run_date to efficiently get this data. With just the run_id, the metadata query pinpoints the records + parquet files to use.

The discerning eye may notice the metadata query above had 250 rows, but the debug statement says "Metadata query identified 248 rows, across 2 parquet files". This is because we added the action="index" to this query, removing a couple of action="delete" rows apparently.

6- We can use "simple" key/value filters or the new where= SQL filtering and demonstrate the results are the same:

run_id = "9fdeab75-03cb-4568-9215-36f8d024de74"
df1 = td.read_dataframe(source="alma", run_id=run_id)
df2 = td.read_dataframe(where=f"""source='alma' and run_id='{run_id}'""")
assert len(df1) == len(df2)
# cannot do .equals() because the ordering may be different...

7- Next, we can show the memory-safe reading of a large number of records. This yields every timdex_record_id from alma records:

for batch in td.read_batches_iter(
    table="records", columns=["timdex_record_id"], source="alma"
):
    continue

8- Lastly, we can demonstrate complex filtering that was not formerly possible with just key/value equality filters:

td.read_dataframe(
    table="records",
    where="""
    -- no gisogm records
    source != 'gisogm'
    and (
        -- any deletes from January
        (
            date_part('month', run_date)=1
            and action='delete'
        )
        -- libguides indexes from June
        or (
            date_part('month', run_date)=6
            and action='index'
            and source='libguides'
        )
    )
    """,
)
"""
DEBUG:timdex_dataset_api.dataset:Metadata query identified 773 rows, across 11 parquet files, elapsed: 1.19s
DEBUG:timdex_dataset_api.dataset:read_batches_iter batch 1, yielded: 773 @ 428 records/second, total yielded: 773
"""

Includes new or updated dependencies?

YES: these changes introduce sqlalchemy for programattic query building

Changes expectations for external applications?

YES: as noted above, tickets have been made for small syntactical changes that pipeline lambdas and TIM will require

What are the relevant tickets?

https://mitlibraries.atlassian.net/browse/TIMX-529

Why these changes are being introduced: TIMDEXDatasetMetadata (TDM) has a DuckDB context for metadata attachments and views. TIMDEXDataset (TD) will need one for DuckDB queries that return actual ETL data, not just metadata. How this addresses that need: * TD reuses the TDM.conn DuckDB connection and builds upon it * TDM DuckDB connection builds metadata related views in a "metadata" schema * TD will build views in a "data" schema Side effects of this change: * All TDM metadata views are under a "metadata" schema Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-529

coveralls · 2025-08-12T13:56:43Z

Pull Request Test Coverage Report for Build 16942404938

Details

118 of 130 (90.77%) changed or added relevant lines in 4 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage decreased (-1.3%) to 93.496%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
timdex_dataset_api/metadata.py	29	30	96.67%
timdex_dataset_api/dataset.py	51	53	96.23%
timdex_dataset_api/utils.py	35	44	79.55%

Totals
Change from base Build 16887999011:	-1.3%
Covered Lines:	460
Relevant Lines:	492

💛 - Coveralls

Why these changes are being introduced: This commit is a culmination of work to elevate metadata about ETL records to the point it can be used to improve the speed and efficiency of data queries. While the signature of the read methods will remain mostly the same, it exposes a 'where' clause that accepts raw SQL to filter the results, allowing for more advanced querying beyond the simple key/value DatasetFilters. Additionally, and equally important, data retrieval is now coming directly from DuckDB instead of more low level pyarrow dataset reads. Overall complexity remains about the same, but we have shifted focus into DuckDB table and view preperation and SQL construction, which also pays dividends in other contexts. It it anticipated this will set us up well for other data we may add to the TIMDEX dataset, e.g. vector embeddings or fulltext, which we may want to query and retrieve. How this addresses that need: As before, all read methods eventually call TIMDEXDataset.read_batches_iter() which now performs a two-part process of first quickly querying metadata records, then using that information to prune heavier data retrieved. SQLAlchemy is used to provide model DuckDB tables and views such that we can preserve the simpler key/value DatasetFilters, e.g. source='libguides' or run_type='daily', which will likely represent the majority of the public API needs by converting those key/value pairs into a SQL WHERE clause programatically. This is done without the need for complex string interpolation and escaping. The overall input and output signatures are largely the same, but the underlying approach to querying the ETL parquet records now utilizes DuckDB much more heavily, while also providing a SQL 'escape hatch' if the keyword filters don't suffice. Side effects of this change: * None! Transmog and TIM can call TDA in the same way as before. The underlying approach is different, but the signatures are mostly the same. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-529

Why these changes are being introduced: During the refactor to use dataset metadata for querying, we had to temporarily skip tests that tested for dataset filtering and current records limiting. With the SQL backed querying now in place, these tests can be reinstated. Note that a future commit will likely *add* a couple more tests for the new, optional 'WHERE' clause functionality. How this addresses that need: * No tests are skipped. * Dataset filtering tests still remain, but the key/value filters are just handled differently under the hood. * Tests for current records no longer use .load(current_records=True) but instead utilize the DuckDB table via table='current_records' within a read method. Side effects of this change: * None Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-529

ghukill · 2025-08-12T14:56:21Z

tests/conftest.py

+@pytest.fixture(scope="module")
+def timdex_dataset_multi_source(tmp_path_factory) -> TIMDEXDataset:


Focusing the comment here, but applies elsewhere too.

Two things are happening here:

By setting scope="module" this fixture is only created once per testing file, e.g. all tests in test_read.py will get the same object. This greatly reduces testing time.

We can only do this if a fixture does not extend any function-level scope (default). Apparently tmp_path is a function level convenience fixture. Using tmp_path_factory is a workaround to that.

ghukill · 2025-08-12T14:57:57Z

timdex_dataset_api/dataset.py

+        conn = self.metadata.conn
+
+        # create data schema
+        conn.execute("""create schema data;""")


Currently, TIMDEXDataset does not create any tables or views in the data schema, but I think it provides some nice symmetry with TIMDEXDatasetMetadata creating a metadata schema. It establishes a pattern that anything utilizing this shared DuckDB context should create schemas to encapsulate their tables/views. I think this will be useful when we get into storing vector embeddings and fulltext in the TIMDEX dataset.

ghukill · 2025-08-12T19:27:18Z

timdex_dataset_api/dataset.py

+        for i, meta_chunk_df in enumerate(self._iter_meta_chunks(meta_df)):
+            batch_time = time.perf_counter()
+            batch_yield_count = len(meta_chunk_df)
+            total_yield_count += batch_yield_count
+
+            if batch_yield_count == 0:
+                continue
+
+            self.conn.register("meta_chunk", meta_chunk_df)
+            data_query = self._build_data_query_for_chunk(
+                columns,
+                meta_chunk_df,
+                registered_metadata_chunk="meta_chunk",
+            )
+            yield from self._stream_data_query_batches(data_query)
+            self.conn.unregister("meta_chunk")


This is perhaps the most key part of these changes.

We go into this code block with a meta_df dataframe that contains metadata about all the rows we want to return from this read method. Instead of a single DuckDB query, we batch the metadata into chunks and perform smaller queries. Very often this batch may only contain a single parquet file, so the reading + filtering will be very quick.

To get that metadata "inside" the DuckDB context to join on and use, we register it as a temporary view/table. Then we perform the data query + joining on metadata. Then after yielding the data, we de-register that metadata chunk for the next iteration.

The larger the metadata chunk we register --> join --> yield, the faster, but the more memory is required. In some fairly substantial testing it feels like the current approach will yield records about as fast as we could ever do anything with them.

ghukill · 2025-08-12T20:01:25Z

timdex_dataset_api/metadata.py

+    def build_meta_query(
+        self, table: str, where: str | None, **filters: Unpack["DatasetFilters"]
+    ) -> str:
+        """Build SQL query using SQLAlchemy against metadata schema tables and views."""


This is admiteddly pretty dense, but it's an operation that is common to many applications where you're translating key/value style arguments into a SQL-like dialect.

The price of those key/value filters -- which will likely be the most common way the dataset is filtered -- is the complexity here of converting that into a SQL WHERE clause.

While SQLAlchemy reflection + query building may seem at a glance more complicated then building a WHERE query string, it quickly becomes apparent that's not the case as you start getting into escaping, logic for every possible SQL operator, etc.

This is a good method to spend some time reading and thinking about, as it's also pretty key for this new approach.

ghukill · 2025-08-12T20:03:59Z

timdex_dataset_api/utils.py

+    engine = create_engine(
+        "duckdb://",
+        creator=lambda: ConnectionWrapper(conn),
+    )


This is a bit of a hack, but kind of unsure another way to do it. We need a SQLAlchemy engine object that we can use for reflection/introspection... but we already have a DuckDB connection, we don't want this to do that for us.

This was inspired by this comment from the duckdb_engine SQLAlchemy library creator.

What's important to remember is that we don't use this engine for anything except introspection, so not terribly critical.

ehanson8

Works as expected, 2 very minor comments!

tests/test_dataset.py

ehanson8 · 2025-08-13T14:50:04Z

timdex_dataset_api/dataset.py

-            "month": run_date_obj.strftime("%m"),
-            "day": run_date_obj.strftime("%d"),
-        }
+    def _stream_data_query_batches(self, data_query: str) -> Iterator[pa.RecordBatch]:


Could this go below read_batches_iter? I think private methods below what calls them is more readable

Yeah, I was on the fence here. If we move this method, I think we'd want to move all methods that are called by read_batches_iter():

The reasoning behind the current arrangement was that read_batches_iter() is followed immediately by all the other read methods:

Honestly, I think I like having the required sub-methods immediately following it as well. I'll make the change, and we can always tweak if not better.

tests/conftest.py

docs/reading.md

ehanson8

Approved!

ghukill added 2 commits August 12, 2025 10:24

ghukill force-pushed the TIMX-529-sql-based-read-methods branch from e1762ec to 262a910 Compare August 12, 2025 14:24

Speedup tests via fixture scoping

b468fc3

ghukill commented Aug 12, 2025

View reviewed changes

ghukill added 2 commits August 12, 2025 11:09

Add read method SQL WHERE tests

7ac193f

Read methods documentation

3448d12

ghukill requested a review from a team August 12, 2025 15:45

ghukill commented Aug 12, 2025

View reviewed changes

ghukill marked this pull request as ready for review August 12, 2025 19:56

ghukill commented Aug 12, 2025

View reviewed changes

ehanson8 self-assigned this Aug 13, 2025

Remove WHERE clause in example mermaid diagram

ea300b8

ehanson8 reviewed Aug 13, 2025

View reviewed changes

Typo and method ordering

b32f841

ghukill requested a review from ehanson8 August 13, 2025 15:49

ehanson8 approved these changes Aug 13, 2025

View reviewed changes

ghukill merged commit 781e4e0 into epic-TIMX-515 Aug 13, 2025
2 checks passed

		@pytest.fixture(scope="module")
		def timdex_dataset_multi_source(tmp_path_factory) -> TIMDEXDataset:

TIMX 529 - read methods utilize SQL + metadata context #161

TIMX 529 - read methods utilize SQL + metadata context #161

Uh oh!

Conversation

ghukill commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose and background context

Changes for Transmogrifier?

Changes for Pipeline Lambdas?

Changes for TIM?

How can a reviewer manually see the effects of these changes?

Includes new or updated dependencies?

Changes expectations for external applications?

What are the relevant tickets?

Uh oh!

coveralls commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 16942404938

Details

💛 - Coveralls

Uh oh!

ghukill Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ghukill Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Aug 12, 2025

Choose a reason for hiding this comment

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ehanson8 Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

ghukill Aug 13, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ehanson8 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ghukill commented Aug 12, 2025 •

edited

Loading

coveralls commented Aug 12, 2025 •

edited

Loading

ghukill Aug 12, 2025 •

edited

Loading