Skip to content

Commit d9df13b

Browse files
committed
Load pyarrow dataset on TIMDEXDataset init
Why these changes are being introduced: As the TIMDEXDatasetMetadata becomes more integrated, there is less need to be explicit about how we load the pyarrow dataset. Formerly, the method .load() needed to be called manually and supported options like 'current_records' or 'include_parquet_files'. This also reflected a time when 'TIMDEXDataset.load()' suggested that "loading" was the pyarrow dataset only. With the introduction of metadata, it is also better to be specific we are loading a pyarrow dataset which is only one of many assets associated with a TIMDEXDataset instance. How this addresses that need: Renames .load() to .load_pyarrow_dataset() to be explicit about what is happening. We no longer store the pyarrow dataset filesystem or paths on self, as they are only used briefly during this dataset load. We can get them anytime via .dataset. Really most important, we limit the root 'location' that we init a TIMDEXDataset instance to be a string only, the root of the dataset. Now that we don't allow a list of strings at that level, we can trust the nature of self.location to be a string, and the root of the TIMDEX dataset. Side effects of this change: * TIMDEXDataset and TIMDEXDatasetMetadata can only be initialized with a string, which is the root of the TIMDEX dataset. From there, both know where their assets can be found. * You cannot "pre-filter" the pyarrow dataset when loading, which had confusing overlap with the read methods; the read methods themselves may change somewhat dramatically now that we have metadata to use. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-533
1 parent 05383bc commit d9df13b

File tree

8 files changed

+346
-598
lines changed

8 files changed

+346
-598
lines changed

README.md

Lines changed: 0 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -110,12 +110,6 @@ timdex_dataset = TIMDEXDataset("s3://my-bucket/path/to/dataset")
110110

111111
# or, local dataset (e.g. testing or development)
112112
timdex_dataset = TIMDEXDataset("/path/to/dataset")
113-
114-
# load the dataset, which discovers all parquet files
115-
timdex_dataset.load()
116-
117-
# or, load the dataset but ensure that only current records are ever yielded
118-
timdex_dataset.load(current_records=True)
119113
```
120114

121115
All read methods for `TIMDEXDataset` allow for the same group of filters which are defined in `timdex_dataset_api.dataset.DatasetFilters`. Examples are shown below.

pyproject.toml

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -54,7 +54,7 @@ line-length = 90
5454
[tool.mypy]
5555
disallow_untyped_calls = true
5656
disallow_untyped_defs = true
57-
exclude = ["tests/", "output/"]
57+
exclude = ["tests/", "output/", "migrations/"]
5858

5959
[[tool.mypy.overrides]]
6060
module = []
@@ -95,6 +95,8 @@ ignore = [
9595
"PLR0915",
9696
"S321",
9797
"S608",
98+
"TD002",
99+
"TD003",
98100
"TRY003"
99101
]
100102

tests/conftest.py

Lines changed: 0 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -82,7 +82,6 @@ def timdex_dataset(tmp_path, timdex_dataset_config) -> TIMDEXDataset:
8282
),
8383
write_append_deltas=False,
8484
)
85-
dataset.load()
8685
return dataset
8786

8887

@@ -110,8 +109,6 @@ def timdex_dataset_multi_source(tmp_path) -> TIMDEXDataset:
110109
),
111110
write_append_deltas=False,
112111
)
113-
114-
dataset.load()
115112
return dataset
116113

117114

@@ -165,8 +162,6 @@ def timdex_dataset_with_runs(tmp_path, timdex_dataset_config_small) -> TIMDEXDat
165162
),
166163
write_append_deltas=False,
167164
)
168-
169-
dataset.load()
170165
return dataset
171166

172167

@@ -202,8 +197,6 @@ def timdex_dataset_same_day_runs(tmp_path) -> TIMDEXDataset:
202197
),
203198
write_append_deltas=False,
204199
)
205-
206-
dataset.load()
207200
return dataset
208201

209202

0 commit comments

Comments
 (0)