Question mark chrysalis #157

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Open

sbilge wants to merge 74 commits into main from question_mark_chrysalis

Member

sbilge commented Oct 9, 2025

No description provided.

sbilge added 2 commits

October 9, 2025 13:18


          transformation service epic specs

3ae0d3d


          transformation service image

dc627fc

sbilge requested a review from mephenor

October 9, 2025 13:34

sbilge added 2 commits

October 9, 2025 14:46


          question

a360e37


          removed open question

43fb6b7

mephenor requested changes

View reviewed changes

Member

mephenor left a comment

This was more of a structural level review and will need a second round that will focus more on the actual content.
In the meantime, I'll have another look at the backing documents to have the correct mental model for the content review.

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

sbilge added 3 commits

October 22, 2025 18:57


          detailed

03f79bc


          minor

5239e62


          more specific types

35e2f95

sbilge requested a review from mephenor

October 22, 2025 19:01

sbilge marked this pull request as ready for review

October 22, 2025 19:01


          examples added to user journeys

b290675

sbilge requested a review from lkuchenb

October 28, 2025 08:36

lkuchenb reviewed

View reviewed changes

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

sbilge requested a review from Cito

November 5, 2025 13:32


          ets revisited

72fbd6d

sbilge requested a review from lkuchenb

November 17, 2025 20:28

Member Author

sbilge commented Nov 17, 2025 •

edited

Loading

Ready for another round. I updated the document based on Christoph's design/suggestions.

sbilge added 3 commits

November 17, 2025 20:34


          deleted outdated image

ffa654e


          collection naming convention

12fe137


          stricter graph rules

64c372f

Cito approved these changes

View reviewed changes

Member

Cito left a comment

Thank you, looks pretty clear already, but I have made a couple of suggestions below.

And we still need to define the mechanism that detects/triggers config changes.

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Show resolved Hide resolved

Cito self-requested a review

November 18, 2025 14:42

Cito requested changes

View reviewed changes

Member

Cito left a comment

Sorry, wanted to request the above changes.

sbilge and others added 6 commits

November 18, 2025 18:46


          Update 81-question-mark-chrysalis/technical_specification.md

a993831

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

a9be559

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

c42aebd

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

9f85310

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

362b677

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

Co-authored-by: Christoph Zwerschke <[email protected]>

sbilge added 3 commits

November 21, 2025 12:56


          edited configs definition

776be06


          more explanation on created attribute

0fd0a9f


          future extensions are added

9c83cde

Cito requested changes

View reviewed changes

Member

Cito left a comment

Minor suggestions to make the wording more consistent.

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

sbilge and others added 15 commits

November 25, 2025 14:25


          remove collections

23cf0e7


          Update 81-question-mark-chrysalis/technical_specification.md

0c3dbb4

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

e3ef19b

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

77905d0

Co-authored-by: Christoph Zwerschke <[email protected]>


          attribute renamed

7322dad


          Update 81-question-mark-chrysalis/technical_specification.md

d952859

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

391f864

Co-authored-by: Christoph Zwerschke <[email protected]>


          suggested renaming applied to entire text

8d56804


          Update 81-question-mark-chrysalis/technical_specification.md

7496fcb

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

ab54e9e

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

f49024e

Co-authored-by: Christoph Zwerschke <[email protected]>


          updated not included section

42a4ac4


          re-wording on validation

2e6c337


          editing the entire document

a2c8255


          minor points added

54d5b3e

sbilge requested a review from Cito

November 26, 2025 11:23


          db in memory

Cito requested changes

View reviewed changes

Member

Cito left a comment

I like your simplifications and improved wording.

Added a few more suggestions below, primarily concerning our new idea of storing only known-valid model/route/workflow configs in the database.

81-question-mark-chrysalis/technical_specification.md Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md

+              - schema: Schema in SchemaPack format, None if it is not an EMIM
+              - publish: Boolean indicating whether the AnnotatedEMPacks conforming to the schema should be published
+              - order: Order in a topological ordering of the schemas in the transformation graph, None if not yet computed

Member

Cito Nov 26, 2025 •

edited

Loading

Note: We decided to store models (and workflows and routes) in the database only after the config has been validated and order and derived schemas have been computed.

This means we can actually require schema and order to be not null when the Pydantic model is used for the database. When it is used to read and validated the config, the schema should be None (default) when is_ingress is false, and we don't need the order.

Therefore, should we create two Model classes? E.g. one RawModel (used for reading the config) which would not have order and where schema can be null, and a validator that checks that is_ingress is the same as schema is not null. The Model (used for the DAO) could then inherit from RawModel, with stricter schemafield (not null), and adding theorder` field.

Member

Cito Nov 26, 2025

Maybe add a note that "BaseModel" is just the name of the base class in Pydantic, and does not have anything to do with our use of the term "Model".

81-question-mark-chrysalis/technical_specification.md

+              - Config.version: Configuration identifier (always 0 in the first implementation).
+              - Config.created: Datetime when this config was validated and activated.
+              - Config.config: Transformation configuration data.

Member

Cito Nov 26, 2025

Since we decided to store only valid configs in the database, and do the validation and processing in memory, we actually don't need these classes any more, or just a sincle class to validate the config YAML.

class Config(BaseModel):
   models: list[RawModel]
   workflows: list[Workflow]
   routes: list[Route]

Note the models here should be RawModels that do not have derived schemapacks and the order field (see above). We could also call it RawConfig to make this clear.

81-question-mark-chrysalis/technical_specification.md


		#### Database Layer

		The em-transformation-service interacts with a database containing core entities. Models, workflows, and routes are loaded from a configuration YAML file and kept in fully in memory. The configuration YAML file is also stored in the database after a successful validation along with the re-derived models if there is a change. AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation outputs. They are loaded into memory on demand. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as they are needed for running a full transformation graph corresponding to a single original data.

Member

Cito Nov 26, 2025

Suggested change

      
            The em-transformation-service interacts with a database containing core entities. Models, workflows, and routes are loaded from a configuration YAML file and kept in fully in memory. The configuration YAML file is also stored in the database after a successful validation along with the re-derived models if there is a change. AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation outputs. They are loaded into memory on demand. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as they are needed for running a full transformation graph corresponding to a single original data.
          
            The em-transformation-service interacts with a database containing the core entities described above.
          
            On startup, the service reads all Models, Workflows and Routes from a config YAML and from the database. These objects should be kept completely in memory and they are only written back to the database if they are in a consistent, validated state with all necessary information (derived schemas, ordering) already computed.
          
            If the content of the config YAML file corresponds to what is stored in the database, the service continues to use the known-valid and pre-computed config from the database. If there are any changes in the YAML config, it will be validated, and the missing information (derived schemas, ordering) re-computed. If there are any errors, the YAML config will be rejected, otherwise stored in the database as new configuration.
          
            AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation outputs. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as they are needed for running a full transformation graph corresponding to a single original piece of data, as explained in a later section..

Member Author

sbilge Nov 27, 2025

On startup, the service reads all Models, Workflows and Routes from a config YAML and from the database.

Do we:

read the YAML into the service, validate it, and then save it to the database when the database is not populated at all; and
when the configuration changes, load the YAML again, compare it with the global config, and if different, validate it and save it to the database?

Or do we:

save the YAML to the database first and then read it into memory from the database when the initial database is empty; and
on configuration changes, update the YAML in the database and then load it from there?

How can we read it from the database if we do not write the YAML to the database before doing the validation?

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md

+. Use metldata to run the workflow identified by `workflow_name` on the input model’s schema referenced by `input_model_name` and compute the derived schema.
+. Update the output model’s schema accordingly.
+. If any errors or conflicts occur, abort the operation and report them.

Member

Cito Nov 26, 2025

The configuration with the updated order and derived schemapacks should be only written to the database if it was successfully validated and there were no errors during model derivation.

If there are any errors, the configuration in memory is re-loaded from the database. If there is no configuration stored in the database yet, the service should stop at this point.

Member

Cito Nov 26, 2025 •

edited

Loading

Consider including "model derivation" into the "validation" step, since a failure during this step should be handled exactly like a failure in the validation step above. Since we already pre-compute the order in the "validation" step, it feels natural to also pre-compute the derived models in that step. A separation of these steps feels awkward to me.

Member

Cito Nov 26, 2025 •

edited

Loading

Add a paragraph here that mentions that the validation (including the model derivation) should be protected by some kind of global lock that would prevent other instances of the service from running the validation and model derivation in parallel, and would also stop the processing of AnnotatedEMPack transformations while the lock is active.

This lock could be implemented via a special "lock" collection in the database that would contain a certain document while the lock is active.

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

81-question-mark-chrysalis/technical_specification.md Outdated Show resolved Hide resolved

sbilge and others added 7 commits

November 26, 2025 16:42


          Update 81-question-mark-chrysalis/technical_specification.md

385cf2c

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

5c7b67b

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

06d7879

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

c3f6abb

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

8cf5e57

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

855a86d

Co-authored-by: Christoph Zwerschke <[email protected]>


          Update 81-question-mark-chrysalis/technical_specification.md

2ae0133

Co-authored-by: Christoph Zwerschke <[email protected]>

mephenor reviewed

View reviewed changes

81-question-mark-chrysalis/technical_specification.md


		- name: Unique identifier / human-readable name for the workflow route. Follows the format of `{input_model_name}:{workflow_name}:{output_model_name}`.

		The model must include a validator to ensure name consistency. Only one representation of the name should be provided in the transformation configuration. The validator automatically derives any missing parts: if only the composite name is given, it extracts the individual names; if only the individual names are provided, it constructs the composite name.

Member

mephenor Nov 27, 2025

I get the format validation part, but am a bit confused about the derivation part:
From the model, all parts need to be provided, so there's no need to derive missing parts, only to check if the names and the composite names are consistent, which also seems redundant.
Couldn't the composite name always be derived given all three other names?

81-question-mark-chrysalis/technical_specification.md


		#### Database Layer

		The em-transformation-service interacts with a database containing core entities. Models, workflows, and routes are loaded from a configuration YAML file and kept in fully in memory. The configuration YAML file is also stored in the database after a successful validation along with the re-derived models if there is a change. AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation outputs. They are loaded into memory on demand. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as they are needed for running a full transformation graph corresponding to a single original data.

Member

mephenor Nov 27, 2025

Suggested change

      
            The em-transformation-service interacts with a database containing core entities. Models, workflows, and routes are loaded from a configuration YAML file and kept in fully in memory. The configuration YAML file is also stored in the database after a successful validation along with the re-derived models if there is a change. AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation outputs. They are loaded into memory on demand. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as they are needed for running a full transformation graph corresponding to a single original data.
          
            The em-transformation-service interacts with a database containing core entities. Models, workflows, and routes are loaded from a configuration YAML file and kept fully in memory. The configuration YAML file is also stored in the database after a successful validation along with the re-derived models if there is a change. AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation output. They are loaded into memory on demand. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as needed to run a full transformation graph corresponding to a single original data.

The last sentence here means what exactly, i.e. when is the intermediate data dropped?
Phrasing is also a bit weird.
If I understand correctly, intermediate data is only kept along the path induced by one AnnotatedEMPack?
Is there a even a case where this could be shared across different induced paths/graphs?

81-question-mark-chrysalis/technical_specification.md

Comment on lines +175 to +179

+. Verify that all workflows and EMIMs (i.e., models with is_ingest = true) referenced by the routes exist in the configuration.
+. Ensure that EMIMs do not appear as the output models of any routes.
+. Validate schemas using the SchemaPack library and workflows using the metldata library.
+. Confirm that the graph meets the unique-path requirement and is therefore acyclic.
+. Compute a topological order of the graph’s nodes—e.g. using Kahn’s algorithm—so that each model is processed only after all its dependencies when visiting the models in that order.

Member

mephenor Nov 27, 2025

It's been a while, but isn't a topological sorting via Kahn's only possible if the graph is a DAG, so failing to apply Kahn's should mean the graph has cycles?

Don't know if it also deals with the "diamond" shape problem.
How would we check for that?

81-question-mark-chrysalis/technical_specification.md

+. Verify that all workflows and EMIMs (i.e., models with is_ingest = true) referenced by the routes exist in the configuration.
+. Ensure that EMIMs do not appear as the output models of any routes.
+. Validate schemas using the SchemaPack library and workflows using the metldata library.

Member

mephenor Nov 27, 2025

What's the level of validation that we can check for workflows with metldata, parsability?

81-question-mark-chrysalis/technical_specification.md


		2. Initialize the transformed map: Create a mapping from model names to data. This map will accumulate all data generated during this transformation operation. Initialize it with a single entry: the original model name mapped to the incoming original data.

		3. Traverse the transformation graph: Starting from the EMIM and following the topological order, perform the following for each model:

Member

mephenor Nov 27, 2025

Following the topological order means skipping over entries in that order that don't match any data accumulated so far?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet