pageuppeople-opensource
diff --git a/‎Pipfile‎
Lines changed: 1 addition & 0 deletions b/‎Pipfile‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎Pipfile.lock‎
Lines changed: 60 additions & 61 deletions b/‎Pipfile.lock‎
Lines changed: 60 additions & 61 deletions
diff --git a/‎README.md‎
Lines changed: 87 additions & 14 deletions b/‎README.md‎
Lines changed: 87 additions & 14 deletions
@@ -8,6 +8,7 @@ pytest = "*"
 pep8 = "*"
 pylint = "*"
 autopep8 = "*"
+rope = "*"
 
 [packages]
 psycopg2-binary = "==2.8.2"
 
@@ -7,29 +7,31 @@ A utility that persists state of a data pipeline execution and uses them to dete
 ## Usage
 
 ```
-$ python -m dpo [options] {db-connection-string} <command> [command-parameters]
+$ python -m dpo [options] <db-connection-string> <command> [command-args]
 ```
 
 - `options` include:
   - `--help | -h`: displays help menu.
   - `--log-level | -l`: choose program's logging level, from CRITICAL, ERROR, WARNING, INFO, DEBUG; default is INFO.
 - `db-connection-string`: a [PostgreSQL Db Connection String](http://docs.sqlalchemy.org/en/latest/dialects/postgresql.html#module-sqlalchemy.dialects.postgresql.psycopg2) of the format `postgresql+psycopg2://user:password@host:port/dbname`
 - `command` is the function to be performed by the utility. The currently supported values are:
-  - `init-execution`: Marks the start of a new execution by creating a record for the same in the given database. Returns an `execution-id` which is a GUID identifier of the new execution.
-  - `get-last-successful-execution`: Finds the last successful data pipeline execution. Returns an `execution-id` which is a GUID identifier of the new execution, if found; else returns and empty string.
-  - `get-execution-last-updated-timestamp`: Returns the `last-updated-on` timestamp with timezone of the given `execution-id`. Raises error if given `execution-id` is invalid.
-    - `execution-id`: a GUID identifier of an existing data pipeline execution.
-  - `persist-models`: Saves models of the given `model-type` within the given `execution-id` by persisting hashed checksums of the given models.
+  - `init-execution`: Marks the start of a new execution. Returns an `execution-id` which is a GUID identifier of the new execution.
+  - `get-last-successful-execution`: Finds the last successful execution. Returns an `execution-id` which is a GUID identifier of the new execution, if found; else returns and empty string.
+  - `get-execution-completion-timestamp`: Returns the `last-updated-on` timestamp with timezone of the given `execution-id`. Raises error if given `execution-id` is invalid.
+    - `execution-id`: a GUID identifier of an existing execution.
+  - `init-step`: Saves models of the given `model-type` within the given `execution-id` by persisting hashed checksums of the given models.
     - `execution-id`: identifier of an existing data pipeline execution, ideally as returned by the `init` command.
-    - `model-type`: type of models being processed, choose from `LOAD`, `TRANSFORM`.
+    - `step-name`: name of step being processed, choose from `LOAD`, `TRANSFORM`.
     - `base-path`: absolute or relative path to the models e.g.: `./load`, `/home/local/transform`, `C:/path/to/models`
     - `model-patterns`: one or more unix-style search patterns _(relative to `base-path`)_ for model files. models within a model-type must be named uniquely regardless of their file extension. e.g.: `*.txt`, `**/*.txt`, `./relative/path/to/some_models/**/*.csv`, `relative/path/to/some/more/related/models/**/*.sql`
-  - `compare-models`: Compares the hashed checksums of models between two executions. Returns comma-separated string of changed model names.
-    - `previous-execution-id`: identifier of an existing data pipeline execution, ideally as returned by the `get-last-successful-execution` command.
-    - `current-execution-id`: identifier of an existing data pipeline execution, ideally as returned by the `init` command.
-    - `model-type`: type of models being processed, choose from `LOAD`, `TRANSFORM`.
-  - `complete-execution`: Marks the completion of an existing execution by updating a record for the same in the given database. Returns nothing unless there's an error.
-    - `execution-id`: a GUID identifier of an existing data pipeline execution as returned by the `init` command.
+  - `compare-step-models`: Compares the hashed checksums of models between two executions' steps. Returns comma-separated string of changed model names.
+    - `step-id`: identifier of an existing execution's step, as returned by the `init-step` command.
+    - `previous-execution-id`: identifier of an existing execution, ideally as returned by the `get-last-successful-execution` command.
+  - `complete-step`: Marks the completion of an existing execution's step. Returns nothing unless there's an error.
+    - `step-id`: a GUID identifier of an existing execution's step as returned by the `init-step` command.
+    - `rows-processed`: an optional numeric value to indicate the number of rows processed during this step. supports a postgresql BIGINT type value.
+  - `complete-execution`: Marks the completion of an existing execution. Returns nothing unless there's an error.
+    - `execution-id`: a GUID identifier of an existing execution as returned by the `init-execution` command.
 
 To get help, use:
 
@@ -110,7 +112,7 @@ $ pytest
 
 Before running integration tests, please ensure the following information is configured correctly:
 
-- `tests/integration/test_integration.sh:7`
+- `tests/integration/test_integration.sh:8`
 
 Please ensure that the database connection string points to a valid PostgreSQL instance with valid parameters.
 
@@ -147,3 +149,74 @@ If you do not have `make` installed, you can substitute `make` with:
 ```
 $ ./tests/integration/test_integration.sh
 ```
+
+## Alembic
+
+### To upgrade to the latest schema
+
+```bash
+alembic -c dpo/alembic.ini -x $DESTINATION_DB_URL upgrade head
+```
+
+### Updating the schema
+
+Ensure any new tables inherit from the same Base used in `alembic/env.py`
+
+```python
+from dpo.Shared import BaseEntity
+```
+
+Whenever you make a schema change, run
+
+```bash
+pipenv install .
+alembic -c dpo/alembic.ini -x $DESTINATION_DB_URL revision -m "$REVISION_MESSAGE" --autogenerate
+```
+
+check that the new version in `alembic/versions` is correct
+
+### Downgrading the schema
+
+Whenever you want to downgrade the schema
+
+```bash
+alembic -c dpo/alembic.ini -x $DESTINATION_DB_URL history # see the list of revision ids
+alembic -c dpo/alembic.ini -x $DESTINATION_DB_URL current # see the current revision id
+alembic -c dpo/alembic.ini -x $DESTINATION_DB_URL downgrade -1 # revert back one revision
+alembic -c dpo/alembic.ini -x $DESTINATION_DB_URL downgrade $revision_id # revert back to a revision id, found using the history command
+```
+
+### Inaccurate autogenerated revisions
+
+Does your autogenerated revision not look right?
+
+Try editing the function `use_schema` in `alembic/env.py`, this determines what alembic looks for in the database.
+
+[Relevant Documentation](https://alembic.sqlalchemy.org/en/latest/api/runtime.html?highlight=include_schemas#alembic.runtime.environment.EnvironmentContext.configure.params.include_object)
+
+### New models aren't showing up in upgrade section
+
+Ensure all model classes inherit from the same Base that `alembic/env.py` imports, and that the following class
+properties are set
+
+```python
+__tablename__ = 'your_mapped_table_name'
+__table_args__ = {'schema': Constants.DATA_PIPELINE_EXECUTION_SCHEMA_NAME}
+```
+
+Also try importing the models into `alembic/env.py`, eg
+
+```python
+from dpo.entities import ModelChecksumEntity
+from dpo.entities import DataPipelineExecutionEntity
+```
+
+### Alembic won't pick up my change
+
+[Alembic only supports some changes](https://alembic.sqlalchemy.org/en/latest/autogenerate.html#what-does-autogenerate-detect-and-what-does-it-not-detect)
+
+Try adding raw sql in the `upgrade()` and `downgrade()` functions of your revision
+
+```python
+op.execute(RAW_SQL)
+```