Skip to content
This repository was archived by the owner on Mar 13, 2020. It is now read-only.

Commit 2a6f430

Browse files
authored
Merge pull request #24 from PageUpPeopleOrg/feature/OSC-1362-add-exec-steps-and-stats
[OSC-1362] - Add ability to track execution steps and statistics
2 parents 1b9e486 + c47a29c commit 2a6f430

25 files changed

+798
-327
lines changed

Pipfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,7 @@ pytest = "*"
88
pep8 = "*"
99
pylint = "*"
1010
autopep8 = "*"
11+
rope = "*"
1112

1213
[packages]
1314
psycopg2-binary = "==2.8.2"

Pipfile.lock

Lines changed: 60 additions & 61 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

README.md

Lines changed: 87 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -7,29 +7,31 @@ A utility that persists state of a data pipeline execution and uses them to dete
77
## Usage
88

99
```
10-
$ python -m dpo [options] {db-connection-string} <command> [command-parameters]
10+
$ python -m dpo [options] <db-connection-string> <command> [command-args]
1111
```
1212

1313
- `options` include:
1414
- `--help | -h`: displays help menu.
1515
- `--log-level | -l`: choose program's logging level, from CRITICAL, ERROR, WARNING, INFO, DEBUG; default is INFO.
1616
- `db-connection-string`: a [PostgreSQL Db Connection String](http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html#module-sqlalchemy.dialects.postgresql.psycopg2) of the format `postgresql+psycopg2://user:password@host:port/dbname`
1717
- `command` is the function to be performed by the utility. The currently supported values are:
18-
- `init-execution`: Marks the start of a new execution by creating a record for the same in the given database. Returns an `execution-id` which is a GUID identifier of the new execution.
19-
- `get-last-successful-execution`: Finds the last successful data pipeline execution. Returns an `execution-id` which is a GUID identifier of the new execution, if found; else returns and empty string.
20-
- `get-execution-last-updated-timestamp`: Returns the `last-updated-on` timestamp with timezone of the given `execution-id`. Raises error if given `execution-id` is invalid.
21-
- `execution-id`: a GUID identifier of an existing data pipeline execution.
22-
- `persist-models`: Saves models of the given `model-type` within the given `execution-id` by persisting hashed checksums of the given models.
18+
- `init-execution`: Marks the start of a new execution. Returns an `execution-id` which is a GUID identifier of the new execution.
19+
- `get-last-successful-execution`: Finds the last successful execution. Returns an `execution-id` which is a GUID identifier of the new execution, if found; else returns and empty string.
20+
- `get-execution-completion-timestamp`: Returns the `last-updated-on` timestamp with timezone of the given `execution-id`. Raises error if given `execution-id` is invalid.
21+
- `execution-id`: a GUID identifier of an existing execution.
22+
- `init-step`: Saves models of the given `model-type` within the given `execution-id` by persisting hashed checksums of the given models.
2323
- `execution-id`: identifier of an existing data pipeline execution, ideally as returned by the `init` command.
24-
- `model-type`: type of models being processed, choose from `LOAD`, `TRANSFORM`.
24+
- `step-name`: name of step being processed, choose from `LOAD`, `TRANSFORM`.
2525
- `base-path`: absolute or relative path to the models e.g.: `./load`, `/home/local/transform`, `C:/path/to/models`
2626
- `model-patterns`: one or more unix-style search patterns _(relative to `base-path`)_ for model files. models within a model-type must be named uniquely regardless of their file extension. e.g.: `*.txt`, `**/*.txt`, `./relative/path/to/some_models/**/*.csv`, `relative/path/to/some/more/related/models/**/*.sql`
27-
- `compare-models`: Compares the hashed checksums of models between two executions. Returns comma-separated string of changed model names.
28-
- `previous-execution-id`: identifier of an existing data pipeline execution, ideally as returned by the `get-last-successful-execution` command.
29-
- `current-execution-id`: identifier of an existing data pipeline execution, ideally as returned by the `init` command.
30-
- `model-type`: type of models being processed, choose from `LOAD`, `TRANSFORM`.
31-
- `complete-execution`: Marks the completion of an existing execution by updating a record for the same in the given database. Returns nothing unless there's an error.
32-
- `execution-id`: a GUID identifier of an existing data pipeline execution as returned by the `init` command.
27+
- `compare-step-models`: Compares the hashed checksums of models between two executions' steps. Returns comma-separated string of changed model names.
28+
- `step-id`: identifier of an existing execution's step, as returned by the `init-step` command.
29+
- `previous-execution-id`: identifier of an existing execution, ideally as returned by the `get-last-successful-execution` command.
30+
- `complete-step`: Marks the completion of an existing execution's step. Returns nothing unless there's an error.
31+
- `step-id`: a GUID identifier of an existing execution's step as returned by the `init-step` command.
32+
- `rows-processed`: an optional numeric value to indicate the number of rows processed during this step. supports a postgresql BIGINT type value.
33+
- `complete-execution`: Marks the completion of an existing execution. Returns nothing unless there's an error.
34+
- `execution-id`: a GUID identifier of an existing execution as returned by the `init-execution` command.
3335

3436
To get help, use:
3537

@@ -110,7 +112,7 @@ $ pytest
110112

111113
Before running integration tests, please ensure the following information is configured correctly:
112114

113-
- `tests/integration/test_integration.sh:7`
115+
- `tests/integration/test_integration.sh:8`
114116

115117
Please ensure that the database connection string points to a valid PostgreSQL instance with valid parameters.
116118

@@ -147,3 +149,74 @@ If you do not have `make` installed, you can substitute `make` with:
147149
```
148150
$ ./tests/integration/test_integration.sh
149151
```
152+
153+
## Alembic
154+
155+
### To upgrade to the latest schema
156+
157+
```bash
158+
alembic -c dpo/alembic.ini -x $DESTINATION_DB_URL upgrade head
159+
```
160+
161+
### Updating the schema
162+
163+
Ensure any new tables inherit from the same Base used in `alembic/env.py`
164+
165+
```python
166+
from dpo.Shared import BaseEntity
167+
```
168+
169+
Whenever you make a schema change, run
170+
171+
```bash
172+
pipenv install .
173+
alembic -c dpo/alembic.ini -x $DESTINATION_DB_URL revision -m "$REVISION_MESSAGE" --autogenerate
174+
```
175+
176+
check that the new version in `alembic/versions` is correct
177+
178+
### Downgrading the schema
179+
180+
Whenever you want to downgrade the schema
181+
182+
```bash
183+
alembic -c dpo/alembic.ini -x $DESTINATION_DB_URL history # see the list of revision ids
184+
alembic -c dpo/alembic.ini -x $DESTINATION_DB_URL current # see the current revision id
185+
alembic -c dpo/alembic.ini -x $DESTINATION_DB_URL downgrade -1 # revert back one revision
186+
alembic -c dpo/alembic.ini -x $DESTINATION_DB_URL downgrade $revision_id # revert back to a revision id, found using the history command
187+
```
188+
189+
### Inaccurate autogenerated revisions
190+
191+
Does your autogenerated revision not look right?
192+
193+
Try editing the function `use_schema` in `alembic/env.py`, this determines what alembic looks for in the database.
194+
195+
[Relevant Documentation](https://alembic.sqlalchemy.org/en/latest/api/runtime.html?highlight=include_schemas#alembic.runtime.environment.EnvironmentContext.configure.params.include_object)
196+
197+
### New models aren't showing up in upgrade section
198+
199+
Ensure all model classes inherit from the same Base that `alembic/env.py` imports, and that the following class
200+
properties are set
201+
202+
```python
203+
__tablename__ = 'your_mapped_table_name'
204+
__table_args__ = {'schema': Constants.DATA_PIPELINE_EXECUTION_SCHEMA_NAME}
205+
```
206+
207+
Also try importing the models into `alembic/env.py`, eg
208+
209+
```python
210+
from dpo.entities import ModelChecksumEntity
211+
from dpo.entities import DataPipelineExecutionEntity
212+
```
213+
214+
### Alembic won't pick up my change
215+
216+
[Alembic only supports some changes](https://alembic.sqlalchemy.org/en/latest/autogenerate.html#what-does-autogenerate-detect-and-what-does-it-not-detect)
217+
218+
Try adding raw sql in the `upgrade()` and `downgrade()` functions of your revision
219+
220+
```python
221+
op.execute(RAW_SQL)
222+
```

0 commit comments

Comments
 (0)