-
Notifications
You must be signed in to change notification settings - Fork 1
Question mark chrysalis #157
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
84 commits
Select commit
Hold shift + click to select a range
3ae0d3d
transformation service epic specs
sbilge dc627fc
transformation service image
sbilge a360e37
question
sbilge 43fb6b7
removed open question
sbilge 03f79bc
detailed
sbilge 5239e62
minor
sbilge 35e2f95
more specific types
sbilge b290675
examples added to user journeys
sbilge 72fbd6d
ets revisited
sbilge ffa654e
deleted outdated image
sbilge 12fe137
collection naming convention
sbilge 64c372f
stricter graph rules
sbilge a993831
Update 81-question-mark-chrysalis/technical_specification.md
sbilge a9be559
Update 81-question-mark-chrysalis/technical_specification.md
sbilge c42aebd
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 9f85310
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 362b677
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 7509573
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 1279660
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 1311b2b
removed mongodb specific terms
sbilge f3d6d7b
upper case on objects
sbilge cdfe689
minor Schema model change
sbilge 6afb1ea
minor workflow model change
sbilge 721f3e6
workflow naming schema
sbilge b5ae4c8
removed type
sbilge b3006eb
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 7492204
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 2b23a26
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 6f0c0ea
Update 81-question-mark-chrysalis/technical_specification.md
sbilge c789cd7
Update 81-question-mark-chrysalis/technical_specification.md
sbilge b101625
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 2269b12
validation steps detailed
sbilge 2ce8a21
Update 81-question-mark-chrysalis/technical_specification.md
sbilge e1240a1
Update 81-question-mark-chrysalis/technical_specification.md
sbilge f21dd38
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 26bd754
attribute naming
sbilge bfa95c9
Update 81-question-mark-chrysalis/technical_specification.md
sbilge b799edb
obsolete sentence from previous version removed
sbilge 6f4d74f
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 5c89da3
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 2088a69
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 91b802c
Update 81-question-mark-chrysalis/technical_specification.md
sbilge ac6e5bd
configuration change logic
sbilge 4658982
removed manually populating collections
sbilge 9febd3b
attribute naming
sbilge 5b0e6f6
wording
sbilge d9ac88f
cross reference
sbilge a26ddc3
naming, plus configuration logic
sbilge 776be06
edited configs definition
sbilge 0fd0a9f
more explanation on created attribute
sbilge 9c83cde
future extensions are added
sbilge 23cf0e7
remove collections
sbilge 0c3dbb4
Update 81-question-mark-chrysalis/technical_specification.md
sbilge e3ef19b
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 77905d0
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 7322dad
attribute renamed
sbilge d952859
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 391f864
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 8d56804
suggested renaming applied to entire text
sbilge 7496fcb
Update 81-question-mark-chrysalis/technical_specification.md
sbilge ab54e9e
Update 81-question-mark-chrysalis/technical_specification.md
sbilge f49024e
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 42a4ac4
updated not included section
sbilge 2e6c337
re-wording on validation
sbilge a2c8255
editing the entire document
sbilge 54d5b3e
minor points added
sbilge 8090173
db in memory
sbilge 385cf2c
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 5c7b67b
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 06d7879
Update 81-question-mark-chrysalis/technical_specification.md
sbilge c3f6abb
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 8cf5e57
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 855a86d
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 2ae0133
Update 81-question-mark-chrysalis/technical_specification.md
sbilge f48f066
Model class separation
sbilge d65312c
Config model renamed
sbilge 6480fa0
Update 81-question-mark-chrysalis/technical_specification.md
sbilge dc62fe7
merged user journeys
sbilge fe21c97
BaseModel clarification
sbilge 6728af3
final step on the first journey
sbilge 0c2bb9d
Update 81-question-mark-chrysalis/technical_specification.md
sbilge 127af7e
epic number bump
sbilge d10cade
index update
sbilge 90ad67e
Merge branch 'main' into question_mark_chrysalis
sbilge File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,254 @@ | ||
| # Preliminary Experimental Metadata (EM) Transformation Service (Question Mark Chrysalis) | ||
| **Epic Type:** Implementation Epic | ||
|
|
||
| Epic planning and implementation follow the | ||
| [Epic Planning and Marathon SOP](https://ghga.pages.hzdr.de/internal.ghga.de/main/sops/development/epic_planning/). | ||
|
|
||
| ## Scope | ||
|
|
||
| ### Outline: | ||
|
|
||
| The goal of this epic is to implement a service for the transformation of Experimental Metadata (EM) from one representation/model to another. | ||
| This service shall provide functionality around the configurable workflow concept from the `metldata` library to enable these transformations. | ||
| To this goal, it needs to keep track of the transformation workflows, original and derived data/schema and workflow routes in its database. | ||
|
|
||
|
|
||
| ### Terminology | ||
|
|
||
| `Experimental Metadata (EM)`: Data describing the experimental process of generating the Research Data archived in GHGA. | ||
|
|
||
| `Experimental Metadata Ingress Models (EMIM)`: Schemas that define the structure and format of experimental metadata as it enters the GHGA system from external sources. It follows SchemaPack format. Currently, GHGA supports only a single data ingress model for experimental metadata, which is the [ghga-metadata-schema](https://github.com/ghga-de/ghga-metadata-schema), but this epic is part of the effort to lift that restriction. | ||
|
|
||
| `Universal Discovery Model`: A common target representation to which all the EMs are transformed, serving as a basis for data queries and data display via the GHGA Data Portal. | ||
|
|
||
| `EMPack`: Experimental metadata in datapack format. It follows a relational data schema corresponding to one of the EMIMs. | ||
|
|
||
| `AnnotatedEMPack`: A datapack enriched with additional information such as accessions that help integrate the EMPack into the archive. | ||
|
|
||
| `Workflow`: Instructions on how to produce a certain datapack/schemapack representation from another datapack/schemapack. The workflows are defined by the `metldata` library on which the transformation service will be built. | ||
|
|
||
| `Route`: Information on which specific workflow is used to produce datapack compliant with the output model when presented with datapack compliant with the input model. | ||
|
|
||
| `Model`: an object that holds all information necessary for the transformation service regarding an EMIM or transformed model. | ||
|
|
||
| ### Included/Required: | ||
sbilge marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| - Implement a service that persists the `AnnotatedEMPack`, `Model`, `Workflow`, and `Route` entities. | ||
| - Implement logic to process incoming AnnotatedEMPacks, execute the corresponding workflows and store the resulting AnnotatedEMPacks if required. | ||
| - Implement code to manage changes in models, workflows and routes, e.g. by creating universal schema descriptions from the ingress schema descriptions and re-transforming the AnnotatedEMPacks. | ||
|
|
||
|
|
||
| ### Core entities | ||
|
|
||
| #### `Model` | ||
|
|
||
| - purpose: describes how models are represented in the service. | ||
|
|
||
| data structure: | ||
| ```python | ||
| from pydantic import BaseModel | ||
|
|
||
| class RawModel(BaseModel): | ||
| name: str | ||
sbilge marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| description: str | None | ||
| is_ingress: bool | ||
| version: str | None | ||
| schema: SchemaPack | None | ||
| publish: bool | ||
|
|
||
| class Model(RawModel): | ||
| schema: SchemaPack | ||
| order: int | ||
|
|
||
| ``` | ||
| - RawModel.name: Unique identifier / human-readable name of the model. | ||
| - RawModel.description: Human-readable description. | ||
| - RawModel.is_ingress: Boolean, true for EMIMs. | ||
| - RawModel.version: Schema version, None if it is not an EMIM | ||
| - RawModel.schema: Schema in SchemaPack format, None if it is not an EMIM and not yet computed | ||
| - RawModel.publish: Boolean indicating whether the AnnotatedEMPacks conforming to the schema should be published | ||
| - Model.order: Order in a topological ordering of the schemas in the transformation graph | ||
| - Model.schema: Schema in SchemaPack format, always defined after derivation. | ||
|
|
||
| `RawModel` is used when deserializing the configuration from YAML files, as it does not include the `order` field that is derived after configuration validation. `RawModel` must include a validator to ensure that EMIMs (is_ingress = true) always have a `schema` defined. And the `schema` is None for non-EMIMs when deserializing the configuration. | ||
|
|
||
sbilge marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| `Model` is used for the DAO. Since the models are stored in the database only after the config is validated, the `Model` class dictates that `order` and `schema` are not null. | ||
|
|
||
| #### `Workflow` | ||
|
|
||
| - Purpose: holds a metldata-compatible workflow definition used to transform schema/data. | ||
|
|
||
| data structure: | ||
| ```python | ||
|
|
||
| from pydantic import BaseModel | ||
|
|
||
| class Workflow(BaseModel): | ||
| name: str | ||
| description: str | None | ||
| workflow: metldata.workflow.base.Workflow | ||
| ``` | ||
|
|
||
| - name: Unique identifier / human-readable name for the workflow. Indicates the purpose of the workflow for easier debugging and understanding of the operations. | ||
| - description: Optional longer description of the workflow. | ||
| - workflow: Workflow definition in metldata format. | ||
|
|
||
|
|
||
| #### `Route` | ||
|
|
||
| - Purpose: describe the routes for transforming models and their corresponding data by referencing the workflow, the input and output models involved in each transformation by name. | ||
|
|
||
| data structure: | ||
|
|
||
| ```python | ||
| from pydantic import BaseModel | ||
|
|
||
| class Route(BaseModel): | ||
| name: str | ||
| input_model_name: str | ||
| output_model_name: str | ||
| workflow_name: str | ||
| ``` | ||
|
|
||
| - name: Unique identifier / human-readable name for the workflow route. Follows the format of `{input_model_name}:{workflow_name}:{output_model_name}`. | ||
|
|
||
| The model must include a validator to ensure name consistency. Only one representation of the name should be provided in the transformation configuration. The validator automatically derives any missing parts: if only the composite name is given, it extracts the individual names; if only the individual names are provided, it constructs the composite name. | ||
sbilge marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| - input_model_name: Name of the input model accepted by the route. | ||
| - output_model_name: Name of the output model produced by the route. | ||
| - workflow_name: Name of the workflow to apply on the route. | ||
|
|
||
|
|
||
| #### `RawConfig` | ||
|
|
||
| - Purpose: describes a new transformation configuration (without derived information) | ||
|
|
||
| data structure: | ||
|
|
||
| ```python | ||
| from pydantic import BaseModel | ||
|
|
||
| class RawConfig(BaseModel): | ||
| models: list[RawModel] | ||
| workflows: list[Workflow] | ||
| routes: list[Route] | ||
|
|
||
| ``` | ||
|
|
||
| - RawConfig.models: List of RawModel objects defining the transformation graph. | ||
| - RawConfig.workflows: List of Workflow objects available. | ||
| - RawConfig.routes: List of Route objects composing the graph. | ||
|
|
||
sbilge marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| #### `AnnotatedEMPack` | ||
|
|
||
| Purpose: holds incoming and derived annotated EM datapacks that are to be processed or published. | ||
|
|
||
| data structure: | ||
|
|
||
| ```python | ||
| from pydantic import BaseModel | ||
|
|
||
| class AnnotatedEMPack(BaseModel): | ||
| id: uuid | ||
| model_name: str | ||
| original_id: str | None | ||
| data: DataPack | ||
| annotation: dict | ||
| ``` | ||
Cito marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| - id: Unique identifier for the AnnotatedEMPack. | ||
| - model_name: Unique name of the model the EMPack conforms to | ||
| - original_id: ID of the original incoming EMPack it was derived from. None if it is an original EMPack | ||
| - data: EMPack conforming to the model identified by `model_name` | ||
| - annotation: Object with information from other models held by the service | ||
|
|
||
|
|
||
| #### Database Layer | ||
|
|
||
| The em-transformation-service interacts with a database containing the core entities described above. | ||
|
|
||
| On startup, the service reads all Models, Workflows and Routes from a config YAML and from the database. These objects should be kept completely in memory and they are only written back to the database if they are in a consistent, validated state with all necessary information (derived schemas, ordering) already computed. | ||
|
|
||
| If the content of the config YAML file corresponds to what is stored in the database, the service continues to use the known-valid and pre-computed config from the database. If there are any changes in the YAML config, it will be validated, and the missing information (derived schemas, ordering) re-computed. If there are any errors, the YAML config will be rejected, otherwise stored in the database as new configuration. | ||
|
|
||
| AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation outputs. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as they are needed for running a full transformation graph corresponding to a single original piece of data, as explained in a later section.. | ||
|
|
||
| #### Transformation Configuration | ||
|
|
||
| The transformations are configured through the models, workflows and routes. | ||
|
|
||
| Routes define how to transform an input model into an output model by specifying the workflow to apply and the expected output model. | ||
|
|
||
| Routes also define a graph where models are nodes and routes are directed edges. The source nodes of this graph are the EMIMs. The graph must not contain any "diamonds", i.e. there must be at most one directed path between any two models in the graph. This also implies that the graph does not contain any cycles, i.e. is a directed acyclic graph (DAG). | ||
|
|
||
| We enforce this stronger unique-path property because "diamond" shapes indicate unnecessary redundancy that should be avoided, and because it eliminates any ambiguity in how data are transformed. | ||
|
|
||
| This graph structure is used to validate the transformation configuration and to determine the processing order of schemas during model derivation and configuration changes. | ||
|
|
||
| #### User Journeys: Transformation Configuration Validation & Model Derivation | ||
|
|
||
| When manually triggered—or when the configuration changes—the service derives the output schemas for all routes. | ||
|
|
||
| 1. Validate the transformation configuration: | ||
|
|
||
| 1. Verify that all workflows and EMIMs (i.e., models with is_ingest = true) referenced by the routes exist in the configuration. | ||
| 2. Ensure that EMIMs do not appear as the output models of any routes. | ||
| 3. Validate schemas using the SchemaPack library and workflows using the metldata library. | ||
| 4. Confirm that the graph meets the unique-path requirement and is therefore acyclic. | ||
| 5. Compute a topological order of the graph’s nodes—e.g. using Kahn’s algorithm—so that each model is processed only after all its dependencies when visiting the models in that order. | ||
| 6. Update the order field of each model based on the computed topological order. | ||
| 7. If the validation fails: | ||
| 1. Reject the configuration. | ||
| 2. Re-load the previous valid configuration from the database. | ||
| 3. If there is no configuration stored in the database yet, stop the service | ||
|
|
||
| 2. Traverse the transformation graph starting from the EMIMs, following the topological order. For each route: | ||
| 1. Use metldata to run the workflow identified by `workflow_name` on the input model’s schema referenced by `input_model_name` and compute the derived schema. | ||
| 2. Update the output model’s schema accordingly. | ||
| 3. If any errors or conflicts occur, abort the operation and report them. | ||
|
|
||
| 3. Store the validated configuration with the derived schemas and ordering in the database. | ||
|
|
||
|
|
||
| The validation (including the model derivation) should be protected by a global lock that would prevent other instances of the service from running the validation and model derivation in parallel, and would also stop the processing of AnnotatedEMPack transformations while the lock is active. | ||
|
|
||
| This lock could be implemented via a special "lock" collection in the database that would contain a certain document while the lock is active. | ||
|
|
||
sbilge marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| #### User Journeys: Service Consumer Transforms An Original AnnotatedEMPack | ||
|
|
||
| The transformation operation is triggered when the service receives either an “original AnnotatedEMPack upsert” event or a “re-transform AnnotatedEMPack” event. | ||
|
|
||
| 1. Build the dirty map: Query the AnnotatedEMPacks collection for all data where `original_id` matches the incoming original data's ID. Extract each data's ID and model name, then create a mapping from model names to data IDs. This "dirty map" tracks data that need to be either re-created or deleted. | ||
|
|
||
| 2. Initialize the transformed map: Create a mapping from model names to data. This map will accumulate all data generated during this transformation operation. Initialize it with a single entry: the original model name mapped to the incoming original data. | ||
|
|
||
| 3. Traverse the transformation graph: Starting from the EMIM and following the topological order, perform the following for each model: | ||
sbilge marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| 1. Get the route that has the current model as its input. Due to the unique-path property, exactly one route must exist; raise an error otherwise. | ||
| 2. Retrieve the data from the transformed map using the route’s input model. This is the “input data” for this step. It should always exist at this point if the topological order was computed correctly, and we can raise an error at this point if this is not the case. | ||
| 3. Compute the transformed data using the input data and the route's workflow. | ||
| 4. If the current model exists in the dirty map, remove it from there and update the transformed map with the newly transformed data. | ||
| 5. Otherwise, add the transformed data to the “transformed map” with a newly generated id, setting its `model_name` to the current model name, and its original_id field to the original data id. | ||
|
|
||
| 4. Apply database updates: Upsert all transformed data that should be published, and delete any remaining data listed in the dirty map. | ||
|
|
||
| This approach avoids deleting "dirty" data during recreation, preventing resources from temporarily disappearing mid-transformation. It also keeps intermediate data in memory, reducing unnecessary database operations. | ||
|
|
||
| #### User Journeys: Configuration Change | ||
|
|
||
| This operation is triggered when a change to the configuration YAML file is detected when the service is started. | ||
|
|
||
| The service detects configuration changes by comparing the configuration in the database with the configuration stored in the config YAML. | ||
|
|
||
| When comparing the configuration, the objects should be compared recursively for equality. This is done automatically in Python when the config is deserialized as a dict or Pydantic object. However, care must be taken to not compare fields that do not exist in the raw configuration (order and derived schemas). We could implement a custom equality method to properly compare Models with RawModels in that regard. | ||
|
|
||
| When they differ, the service adopts the new configuration and performs: | ||
|
|
||
| 1. Re-derivation of all transformed schemas (as described in "Model Derivation") | ||
| 2. Re-transformation of all original AnnotatedEMPacks (as described in "Service Consumer Transforms An Original AnnotatedEMPack") | ||
|
|
||
Cito marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
|
||
| ### Not included (but possible future extensions): | ||
|
|
||
| - A REST API to retrieve the currently used public schemas, which would be useful for frontend developers for inspection purposes. | ||
| - A REST API to push configuration data instead of loading it from YAML files at startup. | ||
| - Configuration versioning to maintain a full history of changes and enable rollbacks to previous versions. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.