Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
84 commits
Select commit Hold shift + click to select a range
3ae0d3d
transformation service epic specs
sbilge Oct 9, 2025
dc627fc
transformation service image
sbilge Oct 9, 2025
a360e37
question
sbilge Oct 9, 2025
43fb6b7
removed open question
sbilge Oct 21, 2025
03f79bc
detailed
sbilge Oct 22, 2025
5239e62
minor
sbilge Oct 22, 2025
35e2f95
more specific types
sbilge Oct 22, 2025
b290675
examples added to user journeys
sbilge Oct 24, 2025
72fbd6d
ets revisited
sbilge Nov 17, 2025
ffa654e
deleted outdated image
sbilge Nov 17, 2025
12fe137
collection naming convention
sbilge Nov 18, 2025
64c372f
stricter graph rules
sbilge Nov 18, 2025
a993831
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 18, 2025
a9be559
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 18, 2025
c42aebd
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 18, 2025
9f85310
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 18, 2025
362b677
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 18, 2025
7509573
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 18, 2025
1279660
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 18, 2025
1311b2b
removed mongodb specific terms
sbilge Nov 18, 2025
f3d6d7b
upper case on objects
sbilge Nov 18, 2025
cdfe689
minor Schema model change
sbilge Nov 18, 2025
6afb1ea
minor workflow model change
sbilge Nov 18, 2025
721f3e6
workflow naming schema
sbilge Nov 18, 2025
b5ae4c8
removed type
sbilge Nov 18, 2025
b3006eb
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 18, 2025
7492204
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 18, 2025
2b23a26
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 18, 2025
6f0c0ea
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 18, 2025
c789cd7
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 18, 2025
b101625
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 18, 2025
2269b12
validation steps detailed
sbilge Nov 19, 2025
2ce8a21
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 19, 2025
e1240a1
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 19, 2025
f21dd38
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 19, 2025
26bd754
attribute naming
sbilge Nov 19, 2025
bfa95c9
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 19, 2025
b799edb
obsolete sentence from previous version removed
sbilge Nov 19, 2025
6f4d74f
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 19, 2025
5c89da3
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 19, 2025
2088a69
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 19, 2025
91b802c
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 19, 2025
ac6e5bd
configuration change logic
sbilge Nov 20, 2025
4658982
removed manually populating collections
sbilge Nov 20, 2025
9febd3b
attribute naming
sbilge Nov 20, 2025
5b0e6f6
wording
sbilge Nov 20, 2025
d9ac88f
cross reference
sbilge Nov 20, 2025
a26ddc3
naming, plus configuration logic
sbilge Nov 21, 2025
776be06
edited configs definition
sbilge Nov 21, 2025
0fd0a9f
more explanation on created attribute
sbilge Nov 21, 2025
9c83cde
future extensions are added
sbilge Nov 21, 2025
23cf0e7
remove collections
sbilge Nov 25, 2025
0c3dbb4
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 25, 2025
e3ef19b
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 25, 2025
77905d0
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 25, 2025
7322dad
attribute renamed
sbilge Nov 25, 2025
d952859
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 25, 2025
391f864
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 25, 2025
8d56804
suggested renaming applied to entire text
sbilge Nov 25, 2025
7496fcb
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 25, 2025
ab54e9e
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 25, 2025
f49024e
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 25, 2025
42a4ac4
updated not included section
sbilge Nov 25, 2025
2e6c337
re-wording on validation
sbilge Nov 25, 2025
a2c8255
editing the entire document
sbilge Nov 26, 2025
54d5b3e
minor points added
sbilge Nov 26, 2025
8090173
db in memory
sbilge Nov 26, 2025
385cf2c
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 26, 2025
5c7b67b
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 26, 2025
06d7879
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 26, 2025
c3f6abb
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 26, 2025
8cf5e57
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 26, 2025
855a86d
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 26, 2025
2ae0133
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 27, 2025
f48f066
Model class separation
sbilge Nov 28, 2025
d65312c
Config model renamed
sbilge Nov 28, 2025
6480fa0
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 28, 2025
dc62fe7
merged user journeys
sbilge Nov 28, 2025
fe21c97
BaseModel clarification
sbilge Nov 28, 2025
6728af3
final step on the first journey
sbilge Nov 28, 2025
0c2bb9d
Update 81-question-mark-chrysalis/technical_specification.md
sbilge Nov 28, 2025
127af7e
epic number bump
sbilge Dec 1, 2025
d10cade
index update
sbilge Dec 1, 2025
90ad67e
Merge branch 'main' into question_mark_chrysalis
sbilge Dec 1, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
254 changes: 254 additions & 0 deletions 83-question-mark-chrysalis/technical_specification.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,254 @@
# Preliminary Experimental Metadata (EM) Transformation Service (Question Mark Chrysalis)
**Epic Type:** Implementation Epic

Epic planning and implementation follow the
[Epic Planning and Marathon SOP](https://ghga.pages.hzdr.de/internal.ghga.de/main/sops/development/epic_planning/).

## Scope

### Outline:

The goal of this epic is to implement a service for the transformation of Experimental Metadata (EM) from one representation/model to another.
This service shall provide functionality around the configurable workflow concept from the `metldata` library to enable these transformations.
To this goal, it needs to keep track of the transformation workflows, original and derived data/schema and workflow routes in its database.


### Terminology

`Experimental Metadata (EM)`: Data describing the experimental process of generating the Research Data archived in GHGA.

`Experimental Metadata Ingress Models (EMIM)`: Schemas that define the structure and format of experimental metadata as it enters the GHGA system from external sources. It follows SchemaPack format. Currently, GHGA supports only a single data ingress model for experimental metadata, which is the [ghga-metadata-schema](https://github.com/ghga-de/ghga-metadata-schema), but this epic is part of the effort to lift that restriction.

`Universal Discovery Model`: A common target representation to which all the EMs are transformed, serving as a basis for data queries and data display via the GHGA Data Portal.

`EMPack`: Experimental metadata in datapack format. It follows a relational data schema corresponding to one of the EMIMs.

`AnnotatedEMPack`: A datapack enriched with additional information such as accessions that help integrate the EMPack into the archive.

`Workflow`: Instructions on how to produce a certain datapack/schemapack representation from another datapack/schemapack. The workflows are defined by the `metldata` library on which the transformation service will be built.

`Route`: Information on which specific workflow is used to produce datapack compliant with the output model when presented with datapack compliant with the input model.

`Model`: an object that holds all information necessary for the transformation service regarding an EMIM or transformed model.

### Included/Required:

- Implement a service that persists the `AnnotatedEMPack`, `Model`, `Workflow`, and `Route` entities.
- Implement logic to process incoming AnnotatedEMPacks, execute the corresponding workflows and store the resulting AnnotatedEMPacks if required.
- Implement code to manage changes in models, workflows and routes, e.g. by creating universal schema descriptions from the ingress schema descriptions and re-transforming the AnnotatedEMPacks.


### Core entities

#### `Model`

- purpose: describes how models are represented in the service.

data structure:
```python
from pydantic import BaseModel

class RawModel(BaseModel):
name: str
description: str | None
is_ingress: bool
version: str | None
schema: SchemaPack | None
publish: bool

class Model(RawModel):
schema: SchemaPack
order: int

```
- RawModel.name: Unique identifier / human-readable name of the model.
- RawModel.description: Human-readable description.
- RawModel.is_ingress: Boolean, true for EMIMs.
- RawModel.version: Schema version, None if it is not an EMIM
- RawModel.schema: Schema in SchemaPack format, None if it is not an EMIM and not yet computed
- RawModel.publish: Boolean indicating whether the AnnotatedEMPacks conforming to the schema should be published
- Model.order: Order in a topological ordering of the schemas in the transformation graph
- Model.schema: Schema in SchemaPack format, always defined after derivation.

`RawModel` is used when deserializing the configuration from YAML files, as it does not include the `order` field that is derived after configuration validation. `RawModel` must include a validator to ensure that EMIMs (is_ingress = true) always have a `schema` defined. And the `schema` is None for non-EMIMs when deserializing the configuration.

`Model` is used for the DAO. Since the models are stored in the database only after the config is validated, the `Model` class dictates that `order` and `schema` are not null.

#### `Workflow`

- Purpose: holds a metldata-compatible workflow definition used to transform schema/data.

data structure:
```python

from pydantic import BaseModel

class Workflow(BaseModel):
name: str
description: str | None
workflow: metldata.workflow.base.Workflow
```

- name: Unique identifier / human-readable name for the workflow. Indicates the purpose of the workflow for easier debugging and understanding of the operations.
- description: Optional longer description of the workflow.
- workflow: Workflow definition in metldata format.


#### `Route`

- Purpose: describe the routes for transforming models and their corresponding data by referencing the workflow, the input and output models involved in each transformation by name.

data structure:

```python
from pydantic import BaseModel

class Route(BaseModel):
name: str
input_model_name: str
output_model_name: str
workflow_name: str
```

- name: Unique identifier / human-readable name for the workflow route. Follows the format of `{input_model_name}:{workflow_name}:{output_model_name}`.

The model must include a validator to ensure name consistency. Only one representation of the name should be provided in the transformation configuration. The validator automatically derives any missing parts: if only the composite name is given, it extracts the individual names; if only the individual names are provided, it constructs the composite name.

- input_model_name: Name of the input model accepted by the route.
- output_model_name: Name of the output model produced by the route.
- workflow_name: Name of the workflow to apply on the route.


#### `RawConfig`

- Purpose: describes a new transformation configuration (without derived information)

data structure:

```python
from pydantic import BaseModel

class RawConfig(BaseModel):
models: list[RawModel]
workflows: list[Workflow]
routes: list[Route]

```

- RawConfig.models: List of RawModel objects defining the transformation graph.
- RawConfig.workflows: List of Workflow objects available.
- RawConfig.routes: List of Route objects composing the graph.

#### `AnnotatedEMPack`

Purpose: holds incoming and derived annotated EM datapacks that are to be processed or published.

data structure:

```python
from pydantic import BaseModel

class AnnotatedEMPack(BaseModel):
id: uuid
model_name: str
original_id: str | None
data: DataPack
annotation: dict
```
- id: Unique identifier for the AnnotatedEMPack.
- model_name: Unique name of the model the EMPack conforms to
- original_id: ID of the original incoming EMPack it was derived from. None if it is an original EMPack
- data: EMPack conforming to the model identified by `model_name`
- annotation: Object with information from other models held by the service


#### Database Layer

The em-transformation-service interacts with a database containing the core entities described above.

On startup, the service reads all Models, Workflows and Routes from a config YAML and from the database. These objects should be kept completely in memory and they are only written back to the database if they are in a consistent, validated state with all necessary information (derived schemas, ordering) already computed.

If the content of the config YAML file corresponds to what is stored in the database, the service continues to use the known-valid and pre-computed config from the database. If there are any changes in the YAML config, it will be validated, and the missing information (derived schemas, ordering) re-computed. If there are any errors, the YAML config will be rejected, otherwise stored in the database as new configuration.

AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation outputs. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as they are needed for running a full transformation graph corresponding to a single original piece of data, as explained in a later section..

#### Transformation Configuration

The transformations are configured through the models, workflows and routes.

Routes define how to transform an input model into an output model by specifying the workflow to apply and the expected output model.

Routes also define a graph where models are nodes and routes are directed edges. The source nodes of this graph are the EMIMs. The graph must not contain any "diamonds", i.e. there must be at most one directed path between any two models in the graph. This also implies that the graph does not contain any cycles, i.e. is a directed acyclic graph (DAG).

We enforce this stronger unique-path property because "diamond" shapes indicate unnecessary redundancy that should be avoided, and because it eliminates any ambiguity in how data are transformed.

This graph structure is used to validate the transformation configuration and to determine the processing order of schemas during model derivation and configuration changes.

#### User Journeys: Transformation Configuration Validation & Model Derivation

When manually triggered—or when the configuration changes—the service derives the output schemas for all routes.

1. Validate the transformation configuration:

1. Verify that all workflows and EMIMs (i.e., models with is_ingest = true) referenced by the routes exist in the configuration.
2. Ensure that EMIMs do not appear as the output models of any routes.
3. Validate schemas using the SchemaPack library and workflows using the metldata library.
4. Confirm that the graph meets the unique-path requirement and is therefore acyclic.
5. Compute a topological order of the graph’s nodes—e.g. using Kahn’s algorithm—so that each model is processed only after all its dependencies when visiting the models in that order.
6. Update the order field of each model based on the computed topological order.
7. If the validation fails:
1. Reject the configuration.
2. Re-load the previous valid configuration from the database.
3. If there is no configuration stored in the database yet, stop the service

2. Traverse the transformation graph starting from the EMIMs, following the topological order. For each route:
1. Use metldata to run the workflow identified by `workflow_name` on the input model’s schema referenced by `input_model_name` and compute the derived schema.
2. Update the output model’s schema accordingly.
3. If any errors or conflicts occur, abort the operation and report them.

3. Store the validated configuration with the derived schemas and ordering in the database.


The validation (including the model derivation) should be protected by a global lock that would prevent other instances of the service from running the validation and model derivation in parallel, and would also stop the processing of AnnotatedEMPack transformations while the lock is active.

This lock could be implemented via a special "lock" collection in the database that would contain a certain document while the lock is active.


#### User Journeys: Service Consumer Transforms An Original AnnotatedEMPack

The transformation operation is triggered when the service receives either an “original AnnotatedEMPack upsert” event or a “re-transform AnnotatedEMPack” event.

1. Build the dirty map: Query the AnnotatedEMPacks collection for all data where `original_id` matches the incoming original data's ID. Extract each data's ID and model name, then create a mapping from model names to data IDs. This "dirty map" tracks data that need to be either re-created or deleted.

2. Initialize the transformed map: Create a mapping from model names to data. This map will accumulate all data generated during this transformation operation. Initialize it with a single entry: the original model name mapped to the incoming original data.

3. Traverse the transformation graph: Starting from the EMIM and following the topological order, perform the following for each model:
1. Get the route that has the current model as its input. Due to the unique-path property, exactly one route must exist; raise an error otherwise.
2. Retrieve the data from the transformed map using the route’s input model. This is the “input data” for this step. It should always exist at this point if the topological order was computed correctly, and we can raise an error at this point if this is not the case.
3. Compute the transformed data using the input data and the route's workflow.
4. If the current model exists in the dirty map, remove it from there and update the transformed map with the newly transformed data.
5. Otherwise, add the transformed data to the “transformed map” with a newly generated id, setting its `model_name` to the current model name, and its original_id field to the original data id.

4. Apply database updates: Upsert all transformed data that should be published, and delete any remaining data listed in the dirty map.

This approach avoids deleting "dirty" data during recreation, preventing resources from temporarily disappearing mid-transformation. It also keeps intermediate data in memory, reducing unnecessary database operations.

#### User Journeys: Configuration Change

This operation is triggered when a change to the configuration YAML file is detected when the service is started.

The service detects configuration changes by comparing the configuration in the database with the configuration stored in the config YAML.

When comparing the configuration, the objects should be compared recursively for equality. This is done automatically in Python when the config is deserialized as a dict or Pydantic object. However, care must be taken to not compare fields that do not exist in the raw configuration (order and derived schemas). We could implement a custom equality method to properly compare Models with RawModels in that regard.

When they differ, the service adopts the new configuration and performs:

1. Re-derivation of all transformed schemas (as described in "Model Derivation")
2. Re-transformation of all original AnnotatedEMPacks (as described in "Service Consumer Transforms An Original AnnotatedEMPack")


### Not included (but possible future extensions):

- A REST API to retrieve the currently used public schemas, which would be useful for frontend developers for inspection purposes.
- A REST API to push configuration data instead of loading it from YAML files at startup.
- Configuration versioning to maintain a full history of changes and enable rollbacks to previous versions.
4 changes: 3 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,4 +97,6 @@ This repository is part of the [Epic Planning and Marathon SOP](https://ghga.pag
- (78) [Royal Angelfish](./78-royal-angelfish/technical_specification.md): Basic User and Access Management
- (79) [Miniature Horse](./79-miniature-horse/technical_specification.md): Data Portal Accessibility, Responsiveness, and Semantics Overhaul and SOPs
- (80) [Hedgehog Seahorse](./80-hedgehog-seahorse/technical_specification.md): GHGA Connector refactoring/rewrite and upload path implementation
- (81) [Laotian Rock Rat](./81-laotian-rock-rat/technical_specification.md): Client Retry Logic Refactoring for Ratelimiting
- (81) [Laotian Rock Rat](./81-laotian-rock-rat/technical_specification.md): Client Retry Logic Refactoring for Ratelimiting
- (83) [Question Mark Chrysalis](./83-question-mark-chrysalis/technical_specification.md): Preliminary Experimental Metadata (EM) Transformation Service