Skip to content

Commit f787266

Browse files
Add bulk-update command to index/delete records from TIMDEX parquet dataset
Why these changes are being introduced: * The timdex-index-manager (TIM) needs to support the v2 parquet dataset, which now contains records for both indexing and deleting. The new CLI command performs a "bulk update" given a subset of the dataset (filtered by 'run_date' and 'run_id') and uses the timdex-dataset-api library to read records from the TIMDEXDataset. By introducing a new CLI command, it doesn't require the feature flagging approach, allowing the existing CLI commands and helper functions to remain untouched for v1 purposes. How this addresses that need: * Implement 'bulk-update' CLI command * Add unit tests for 'bulk-update' * Move v1 CLI unit tests to 'test_cli_v1' * Update README Side effects of this change: * TIM remains backwards v1 compatible but will now support v2 runs. Relevant ticket(s): * https://mitlibraries.atlassian.net/browse/TIMX-428
1 parent b6e8893 commit f787266

File tree

9 files changed

+1391
-998
lines changed

9 files changed

+1391
-998
lines changed

Makefile

Lines changed: 10 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -90,4 +90,13 @@ dist-stage:
9090
publish-stage:
9191
docker login -u AWS -p $$(aws ecr get-login-password --region us-east-1) $(ECR_URL_STAGE)
9292
docker push $(ECR_URL_STAGE):latest
93-
docker push $(ECR_URL_STAGE):`git describe --always`
93+
docker push $(ECR_URL_STAGE):`git describe --always`
94+
95+
##############################
96+
# Local Opensearch commands
97+
##############################
98+
99+
local-opensearch: # Run a local instance of Opensearch via Docker Compose
100+
docker pull opensearchproject/opensearch:latest
101+
docker pull opensearchproject/opensearch-dashboards:latest
102+
docker compose --env-file .env up

Pipfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ mypy = "*"
2121
pre-commit = "*"
2222
pytest = "*"
2323
ruff = "*"
24+
timdex-dataset-api = { git = "git+https://github.com/MITLibraries/timdex-dataset-api.git"}
2425
vcrpy = "*"
2526

2627
[requires]

Pipfile.lock

Lines changed: 968 additions & 739 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 13 additions & 24 deletions
Original file line numberDiff line numberDiff line change
@@ -9,6 +9,7 @@ TIMDEX! Index Manager (TIM) is a Python CLI application for managing TIMDEX indi
99
- To update dependencies: `make update`
1010
- To run unit tests: `make test`
1111
- To lint the repo: `make lint`
12+
- To run local OpenSearch with Docker: `make local-opensearch`
1213
- To run the app: `pipenv run tim --help`
1314

1415
**Important note:** The sections that follow provide instructions for running OpenSearch **locally with Docker**. These instructions are useful for testing. Please make sure the environment variable `TIMDEX_OPENSEARCH_ENDPOINT` is **not** set before proceeding.
@@ -92,34 +93,21 @@ For a more detailed example with test data, please refer to the Confluence docum
9293

9394
### Required ENV
9495

95-
```
96-
# Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.
97-
WORKSPACE=dev
96+
```shell
97+
WORKSPACE=### Set to `dev` for local development, this will be set to `stage` and `prod` in those environments by Terraform.
9898
```
9999
100100
## Optional ENV
101101
102-
```
103-
# Only needed if AWS region changes from the default of us-east-1.
104-
AWS_REGION=
105-
106-
# Chunk size limit for sending requests to the bulk indexing endpoint, in bytes. Defaults to 104857600 (100 * 1024 * 1024) if not set.
107-
OPENSEARCH_BULK_MAX_CHUNK_BYTES=
108-
109-
# Maximum number of retries when sending requests to the bulk indexing endpoint. Defaults to 50 if not set.
110-
OPENSEARCH_BULK_MAX_RETRIES=
111-
112-
# Only used for OpenSearch requests that tend to take longer than the default timeout of 10 seconds, such as bulk or index refresh requests. Defaults to 120 seconds if not set.
113-
OPENSEARCH_REQUEST_TIMEOUT=
114-
115-
# The ingest process logs the # of records indexed every nth record. Set this env variable to any integer to change the frequency of logging status updates. Can be useful for development/debugging. Defaults to 1000 if not set.
116-
STATUS_UPDATE_INTERVAL=
117-
118-
# If using a local Docker OpenSearch instance, this isn't needed. Otherwise set to OpenSearch instance endpoint without the http scheme (e.g., "search-timdex-env-1234567890.us-east-1.es.amazonaws.com"). Can also be passed directly to the CLI via the `--url` option.
119-
TIMDEX_OPENSEARCH_ENDPOINT=
120-
121-
# If set to a valid Sentry DSN, enables Sentry exception monitoring This is not needed for local development.
122-
SENTRY_DSN=
102+
```shell
103+
AWS_REGION=### Only needed if AWS region changes from the default of us-east-1.
104+
OPENSEARCH_BULK_MAX_CHUNK_BYTES=### Chunk size limit for sending requests to the bulk indexing endpoint, in bytes. Defaults to 104857600 (100 * 1024 * 1024) if not set.
105+
OPENSEARCH_BULK_MAX_RETRIES=### Maximum number of retries when sending requests to the bulk indexing endpoint. Defaults to 50 if not set.
106+
OPENSEARCH_INITIAL_ADMIN_PASSWORD=###If using a local Docker OpenSearch instance, this must be set (for versions >= 2.12.0).
107+
OPENSEARCH_REQUEST_TIMEOUT=### Only used for OpenSearch requests that tend to take longer than the default timeout of 10 seconds, such as bulk or index refresh requests. Defaults to 120 seconds if not set.
108+
STATUS_UPDATE_INTERVAL=### The ingest process logs the # of records indexed every nth record. Set this env variable to any integer to change the frequency of logging status updates. Can be useful for development/debugging. Defaults to 1000 if not set.
109+
TIMDEX_OPENSEARCH_ENDPOINT=### If using a local Docker OpenSearch instance, this isn't needed. Otherwise set to OpenSearch instance endpoint without the http scheme (e.g., "search-timdex-env-1234567890.us-east-1.es.amazonaws.com"). Can also be passed directly to the CLI via the `--url` option.
110+
SENTRY_DSN=### If set to a valid Sentry DSN, enables Sentry exception monitoring This is not needed for local development.
123111
```
124112
125113
## CLI commands
@@ -153,6 +141,7 @@ Usage: tim [OPTIONS] COMMAND [ARGS]...
153141
╭─ Bulk record processing commands ───────────────────────────────────────────────────────────────────────────────────────────────────╮
154142
│ bulk-index Bulk index records into an index. │
155143
│ bulk-delete Bulk delete records from an index. │
144+
│ bulk-update Bulk update records from an index. │
156145
╰─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯
157146
```
158147

compose.yaml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ services:
1010
- discovery.type=single-node
1111
- bootstrap.memory_lock=true
1212
- "OPENSEARCH_JAVA_OPTS=-Xms512m -Xmx512m"
13+
- OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD}
1314
volumes:
1415
- opensearch-local-data:/usr/share/opensearch/data
1516
networks:
@@ -21,6 +22,7 @@ services:
2122
environment:
2223
- "DISABLE_SECURITY_DASHBOARDS_PLUGIN=true"
2324
- 'OPENSEARCH_HOSTS=["http://opensearch:9200"]'
25+
- OPENSEARCH_INITIAL_ADMIN_PASSWORD=${OPENSEARCH_INITIAL_ADMIN_PASSWORD}
2426
networks:
2527
- opensearch-local-net
2628
volumes:

pyproject.toml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ disallow_untyped_defs = true
77
exclude = ["tests/"]
88

99
[[tool.mypy.overrides]]
10-
module = ["ijson", "smart_open"]
10+
module = ["ijson", "smart_open", "timdex_dataset_api"]
1111
ignore_missing_imports = true
1212

1313
[tool.pytest.ini_options]
@@ -27,8 +27,6 @@ select = ["ALL", "PT"]
2727

2828
ignore = [
2929
# default
30-
"ANN101",
31-
"ANN102",
3230
"COM812",
3331
"D107",
3432
"N812",
@@ -41,6 +39,7 @@ ignore = [
4139
"D102",
4240
"D103",
4341
"D104",
42+
"G004",
4443
"PLR0912",
4544
"PLR0913",
4645
"PLR0915",

0 commit comments

Comments
 (0)