Skip to content

Conversation

@sbilge
Copy link
Member

@sbilge sbilge commented Oct 9, 2025

No description provided.

@sbilge sbilge requested a review from mephenor October 9, 2025 13:34
Copy link
Member

@mephenor mephenor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was more of a structural level review and will need a second round that will focus more on the actual content.
In the meantime, I'll have another look at the backing documents to have the correct mental model for the content review.

@sbilge sbilge requested a review from mephenor October 22, 2025 19:01
@sbilge sbilge marked this pull request as ready for review October 22, 2025 19:01
@sbilge sbilge requested a review from lkuchenb October 28, 2025 08:36
@sbilge sbilge requested a review from Cito November 5, 2025 13:32
@sbilge sbilge requested a review from lkuchenb November 17, 2025 20:28
@sbilge
Copy link
Member Author

sbilge commented Nov 17, 2025

Ready for another round. I updated the document based on Christoph's design/suggestions.

Copy link
Member

@Cito Cito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, looks pretty clear already, but I have made a couple of suggestions below.

And we still need to define the mechanism that detects/triggers config changes.

@Cito Cito self-requested a review November 18, 2025 14:42
Copy link
Member

@Cito Cito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, wanted to request the above changes.

Copy link
Member

@Cito Cito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor suggestions to make the wording more consistent.

@sbilge sbilge requested a review from Cito November 26, 2025 11:23
Copy link
Member

@Cito Cito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like your simplifications and improved wording.

Added a few more suggestions below, primarily concerning our new idea of storing only known-valid model/route/workflow configs in the database.

- schema: Schema in SchemaPack format, None if it is not an EMIM
- publish: Boolean indicating whether the AnnotatedEMPacks conforming to the schema should be published
- order: Order in a topological ordering of the schemas in the transformation graph, None if not yet computed

Copy link
Member

@Cito Cito Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: We decided to store models (and workflows and routes) in the database only after the config has been validated and order and derived schemas have been computed.

This means we can actually require schema and order to be not null when the Pydantic model is used for the database. When it is used to read and validated the config, the schema should be None (default) when is_ingress is false, and we don't need the order.

Therefore, should we create two Model classes? E.g. one RawModel (used for reading the config) which would not have order and where schema can be null, and a validator that checks that is_ingress is the same as schema is not null. The Model (used for the DAO) could then inherit from RawModel, with stricter schemafield (not null), and adding theorder` field.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add a note that "BaseModel" is just the name of the base class in Pydantic, and does not have anything to do with our use of the term "Model".

- Config.version: Configuration identifier (always 0 in the first implementation).
- Config.created: Datetime when this config was validated and activated.
- Config.config: Transformation configuration data.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we decided to store only valid configs in the database, and do the validation and processing in memory, we actually don't need these classes any more, or just a sincle class to validate the config YAML.

class Config(BaseModel):
   models: list[RawModel]
   workflows: list[Workflow]
   routes: list[Route]

Note the models here should be RawModels that do not have derived schemapacks and the order field (see above). We could also call it RawConfig to make this clear.


#### Database Layer

The em-transformation-service interacts with a database containing core entities. Models, workflows, and routes are loaded from a configuration YAML file and kept in fully in memory. The configuration YAML file is also stored in the database after a successful validation along with the re-derived models if there is a change. AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation outputs. They are loaded into memory on demand. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as they are needed for running a full transformation graph corresponding to a single original data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The em-transformation-service interacts with a database containing core entities. Models, workflows, and routes are loaded from a configuration YAML file and kept in fully in memory. The configuration YAML file is also stored in the database after a successful validation along with the re-derived models if there is a change. AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation outputs. They are loaded into memory on demand. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as they are needed for running a full transformation graph corresponding to a single original data.
The em-transformation-service interacts with a database containing the core entities described above.
On startup, the service reads all Models, Workflows and Routes from a config YAML and from the database. These objects should be kept completely in memory and they are only written back to the database if they are in a consistent, validated state with all necessary information (derived schemas, ordering) already computed.
If the content of the config YAML file corresponds to what is stored in the database, the service continues to use the known-valid and pre-computed config from the database. If there are any changes in the YAML config, it will be validated, and the missing information (derived schemas, ordering) re-computed. If there are any errors, the YAML config will be rejected, otherwise stored in the database as new configuration.
AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation outputs. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as they are needed for running a full transformation graph corresponding to a single original piece of data, as explained in a later section..

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On startup, the service reads all Models, Workflows and Routes from a config YAML and from the database.

Do we:

  1. read the YAML into the service, validate it, and then save it to the database when the database is not populated at all; and
  2. when the configuration changes, load the YAML again, compare it with the global config, and if different, validate it and save it to the database?

Or do we:

  1. save the YAML to the database first and then read it into memory from the database when the initial database is empty; and
  2. on configuration changes, update the YAML in the database and then load it from there?

How can we read it from the database if we do not write the YAML to the database before doing the validation?

1. Use metldata to run the workflow identified by `workflow_name` on the input model’s schema referenced by `input_model_name` and compute the derived schema.
2. Update the output model’s schema accordingly.
3. If any errors or conflicts occur, abort the operation and report them.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The configuration with the updated order and derived schemapacks should be only written to the database if it was successfully validated and there were no errors during model derivation.

If there are any errors, the configuration in memory is re-loaded from the database. If there is no configuration stored in the database yet, the service should stop at this point.

Copy link
Member

@Cito Cito Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider including "model derivation" into the "validation" step, since a failure during this step should be handled exactly like a failure in the validation step above. Since we already pre-compute the order in the "validation" step, it feels natural to also pre-compute the derived models in that step. A separation of these steps feels awkward to me.

Copy link
Member

@Cito Cito Nov 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a paragraph here that mentions that the validation (including the model derivation) should be protected by some kind of global lock that would prevent other instances of the service from running the validation and model derivation in parallel, and would also stop the processing of AnnotatedEMPack transformations while the lock is active.

This lock could be implemented via a special "lock" collection in the database that would contain a certain document while the lock is active.


- name: Unique identifier / human-readable name for the workflow route. Follows the format of `{input_model_name}:{workflow_name}:{output_model_name}`.

The model must include a validator to ensure name consistency. Only one representation of the name should be provided in the transformation configuration. The validator automatically derives any missing parts: if only the composite name is given, it extracts the individual names; if only the individual names are provided, it constructs the composite name.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get the format validation part, but am a bit confused about the derivation part:
From the model, all parts need to be provided, so there's no need to derive missing parts, only to check if the names and the composite names are consistent, which also seems redundant.
Couldn't the composite name always be derived given all three other names?


#### Database Layer

The em-transformation-service interacts with a database containing core entities. Models, workflows, and routes are loaded from a configuration YAML file and kept in fully in memory. The configuration YAML file is also stored in the database after a successful validation along with the re-derived models if there is a change. AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation outputs. They are loaded into memory on demand. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as they are needed for running a full transformation graph corresponding to a single original data.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The em-transformation-service interacts with a database containing core entities. Models, workflows, and routes are loaded from a configuration YAML file and kept in fully in memory. The configuration YAML file is also stored in the database after a successful validation along with the re-derived models if there is a change. AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation outputs. They are loaded into memory on demand. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as they are needed for running a full transformation graph corresponding to a single original data.
The em-transformation-service interacts with a database containing core entities. Models, workflows, and routes are loaded from a configuration YAML file and kept fully in memory. The configuration YAML file is also stored in the database after a successful validation along with the re-derived models if there is a change. AnnotatedEMPacks are populated from incoming events (from the GHGA Study Repository) and transformation output. They are loaded into memory on demand. Only published data is stored persistently (using an outbox DAO). Intermediate transformed data is kept in memory as long as needed to run a full transformation graph corresponding to a single original data.

The last sentence here means what exactly, i.e. when is the intermediate data dropped?
Phrasing is also a bit weird.
If I understand correctly, intermediate data is only kept along the path induced by one AnnotatedEMPack?
Is there a even a case where this could be shared across different induced paths/graphs?

Comment on lines +175 to +179
1. Verify that all workflows and EMIMs (i.e., models with is_ingest = true) referenced by the routes exist in the configuration.
2. Ensure that EMIMs do not appear as the output models of any routes.
3. Validate schemas using the SchemaPack library and workflows using the metldata library.
4. Confirm that the graph meets the unique-path requirement and is therefore acyclic.
5. Compute a topological order of the graph’s nodes—e.g. using Kahn’s algorithm—so that each model is processed only after all its dependencies when visiting the models in that order.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's been a while, but isn't a topological sorting via Kahn's only possible if the graph is a DAG, so failing to apply Kahn's should mean the graph has cycles?

Don't know if it also deals with the "diamond" shape problem.
How would we check for that?


1. Verify that all workflows and EMIMs (i.e., models with is_ingest = true) referenced by the routes exist in the configuration.
2. Ensure that EMIMs do not appear as the output models of any routes.
3. Validate schemas using the SchemaPack library and workflows using the metldata library.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the level of validation that we can check for workflows with metldata, parsability?


2. Initialize the transformed map: Create a mapping from model names to data. This map will accumulate all data generated during this transformation operation. Initialize it with a single entry: the original model name mapped to the incoming original data.

3. Traverse the transformation graph: Starting from the EMIM and following the topological order, perform the following for each model:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following the topological order means skipping over entries in that order that don't match any data accumulated so far?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants