Skip to content

Conversation

edan-bainglass
Copy link
Member

@edan-bainglass edan-bainglass commented Aug 24, 2025

Some parts of the ORM module do not pass a serialization round trip. This PR addresses this, but further discussion is needed regarding what should actually be (de)serialized. If interested, good to read aiidateam/AEP#40 and discussion on #6255. Some discussion regarding mismatch between an object's constructor args and its Model fields can be found here.

Good to also discuss if the use of pydantic should be extended to ORM instance creation in general, not only by way of from_serialized. This is not implemented in this PR (and is out of scope) but can be addressed in a follow-up PR if deemed "correct".

Open questions

  • There is still the matter of repository_content. The current implementation uses base64 encoding/decoding, but this is clearly not ideal for large files. Some discussion was had w.r.t switching to links to some online storage of files. TBD
  • File serialization is TBD in general, for arrays also

Updates

The PR now also introduces the following:

  • Escape hatch for Data plugins via attributes
  • A guard against unhandled attributes in Data plugins (not allowed in the backend node)
  • A dedicated InputModel - a derived view of the defined entity Model suitable for Entity creation

Note for PR reviewers

There are a few changes that are likely out of scope. These will move to dedicated PRs prior to merge of this PR.

@edan-bainglass edan-bainglass marked this pull request as draft August 24, 2025 08:03
Copy link

codecov bot commented Aug 24, 2025

Codecov Report

❌ Patch coverage is 63.12057% with 104 lines in your changes missing coverage. Please review.
✅ Project coverage is 27.22%. Comparing base (cf690e4) to head (7778e78).

Files with missing lines Patch % Lines
src/aiida/orm/nodes/data/array/kpoints.py 25.65% 29 Missing ⚠️
src/aiida/orm/entities.py 44.19% 24 Missing ⚠️
src/aiida/orm/nodes/data/array/trajectory.py 62.50% 9 Missing ⚠️
src/aiida/orm/nodes/node.py 65.22% 8 Missing ⚠️
src/aiida/orm/nodes/data/data.py 30.00% 7 Missing ⚠️
src/aiida/orm/nodes/data/singlefile.py 53.34% 7 Missing ⚠️
src/aiida/cmdline/groups/dynamic.py 0.00% 4 Missing ⚠️
src/aiida/orm/nodes/data/array/array.py 66.67% 4 Missing ⚠️
src/aiida/cmdline/commands/cmd_code.py 57.15% 3 Missing ⚠️
src/aiida/common/datastructures.py 0.00% 2 Missing ⚠️
... and 5 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6990      +/-   ##
==========================================
+ Coverage   27.16%   27.22%   +0.07%     
==========================================
  Files         566      566              
  Lines       43807    43965     +158     
==========================================
+ Hits        11896    11967      +71     
- Misses      31911    31998      +87     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@edan-bainglass edan-bainglass force-pushed the fix-node-serialization branch 4 times, most recently from 9b7cfbd to d83ced6 Compare August 24, 2025 11:33
@edan-bainglass
Copy link
Member Author

@khsrali regarding one of the failed tests...

tests/orm/nodes/data/test_remote_stash.py::test_constructor_invalid_folder[stash_mode-copy] - ValueError: `RemoteStashFolderData` can only be used with `stash_mode == StashMode.COPY`

In this PR, I lift some validation to the top of the constructor (of stash data classes - see 91e4ad5), to avoid any operations if the object is bound to fail. However, this seems to introduce failures in testing them. It suggests that perhaps there is some order to the operations, though I don't see it. Can you please comment?

@edan-bainglass
Copy link
Member Author

@khsrali regarding one of the failed tests...

tests/orm/nodes/data/test_remote_stash.py::test_constructor_invalid_folder[stash_mode-copy] - ValueError: `RemoteStashFolderData` can only be used with `stash_mode == StashMode.COPY`

In this PR, I lift some validation to the top of the constructor (of stash data classes - see 91e4ad5), to avoid any operations if the object is bound to fail. However, this seems to introduce failures in testing them. It suggests that perhaps there is some order to the operations, though I don't see it. Can you please comment?

It was checking for TypeError, which is what happens when you delay the stash_mode check and provide a non-Enum type for it. But there is no need for it, as the stash_mode check covers that issue as well. I updated the test to parameterize also over error, checking for ValueError in the case of the invalid stash_mode.

@edan-bainglass
Copy link
Member Author

@GeigerJ2 okay, this is ready for others to inspect. I did my best to isolate the commits and provided comments on each. Happy to discuss further.

Pinging also @agoscinski @superstar54 if interested.

Pinging @sphuber for input/feedback, if he has time.

@edan-bainglass
Copy link
Member Author

It may be possible to rely on the post models of aiida-restapi as a reference for defining ORM constructor parameters, as the post models are intended to represent serialized objects passed to the REST API for object construction. Looking into this.

@edan-bainglass edan-bainglass force-pushed the fix-node-serialization branch from 476e144 to 7c191fa Compare October 8, 2025 15:20
@edan-bainglass
Copy link
Member Author

edan-bainglass commented Oct 8, 2025

@danielhollas what is this about?

Critical: potential `verdi` speed problem: `aiida.orm.fields` module is imported which is not in: ('aiida.brokers', 'aiida.cmdline', 'aiida.common', 'aiida.manage', 'aiida.plugins', 'aiida.restapi')

Nevermind. I see that importing orm in the cmdline raises alarm bells. Removed...

@edan-bainglass edan-bainglass force-pushed the fix-node-serialization branch from 516a6df to 2c3e5df Compare October 8, 2025 16:54
@danielhollas
Copy link
Collaborator

Nevermind. I see that importing orm in the cmdline raises alarm bells. Removed...

Nice, the system works. :-) Feel free to improve the error message here to make it more obvious that this is about the aiida.cmdline module.

@edan-bainglass edan-bainglass force-pushed the fix-node-serialization branch from d6e44ff to 0b37ef6 Compare October 8, 2025 18:17
@edan-bainglass
Copy link
Member Author

Nevermind. I see that importing orm in the cmdline raises alarm bells. Removed...

Nice, the system works. :-) Feel free to improve the error message here to make it more obvious that this is about the aiida.cmdline module.

#7059

@edan-bainglass
Copy link
Member Author

@danielhollas what do you think about ignoring Ruff N806 - "~variable should be lowercase"? See case below. Model is not an instance, but a class. Naming it model would be misleading, hence I went with Model.

Model = cls.Model.as_input_model()
^^^^^ N806

@danielhollas
Copy link
Collaborator

@danielhollas what do you think about ignoring Ruff N806 - "~variable should be lowercase"?

Do you mean ignoring locally (fine) or globally?

@edan-bainglass
Copy link
Member Author

edan-bainglass commented Oct 8, 2025

@danielhollas what do you think about ignoring Ruff N806 - "~variable should be lowercase"?

Do you mean ignoring locally (fine) or globally?

Global. There are many cases when the variable is a class, not an instance. I've pushed this in my last commit just to verify that it works. Okay with removing it in favor of local handling, but would like to hear the reason against a global N806 rule.

Update

Nice. Ignoring N806 globally raised a whole lot of RUF100 due to the codebase being littered with local N806 rules. I'd say that supports my case 🙂

Comment on lines 26 to 31
class Model(NumericType.Model):
value: bool = MetadataField(
title='Boolean value.',
description='The value of the boolean',
)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding these on all BaseType subclasses to provide validation, because value: Any in BaseType.Model is pointless.

Comment on lines +25 to +35
class UnhandledDataAttributesError(Exception):
"""Exception raised when any data attributes are not handled prior to the Data constructor."""

def __init__(self, attributes: Iterable[str], class_name: str) -> None:
bullet_list = '\n'.join(f' • {attr}' for attr in attributes)
message = (
f'\nThe following attributes must be handled in a constructor prior to the Data class:\n'
f'{bullet_list}\n\n'
f'Consider implementing a constructor in {class_name} to handle the listed attributes.'
)
super().__init__(message)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See usage below

Comment on lines +75 to +80
# We verify here that all attributes of Data plugins are handled in a constructor prior to the root
# Data class (here), gracefully rejecting them otherwise.
node_keys = set(Node.Model.model_fields.keys()) | {'backend'}
unhandled_keys = {key for key in kwargs if key not in node_keys}
if unhandled_keys:
raise UnhandledDataAttributesError(unhandled_keys, self.__class__.__name__)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This catches cases where a Data subclass defines a Model without an accompanying constructor. In these cases, unhandled attributes defined on the model would bubble up to the backend node constructor, which is strict (no kwargs allowed). This change acts as a gate at the root Data class, informing users of the new pydantic system of the requirements.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes here are not final and depend on the outcome of a discussion regarding file serialization.

Comment on lines +708 to +719
kinds: t.List[dict] = MetadataField(description='The kinds of atoms')
sites: t.List[dict] = MetadataField(description='The atomic sites')

@field_validator('kinds', mode='before')
@classmethod
def _validate_kinds(cls, value: t.List[Kind | dict[str, t.Any]]) -> t.List[t.Dict]:
return [kind.get_raw() if isinstance(kind, Kind) else kind for kind in value]

@field_validator('sites', mode='before')
@classmethod
def _validate_sites(cls, value: t.List[Site | dict[str, t.Any]]) -> t.List[t.Dict]:
return [site.get_raw() if isinstance(site, Site) else site for site in value]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is a structure without kinds and sites?

Comment on lines +279 to 290
repository_content: dict[str, bytes] = MetadataField(
default_factory=dict,
description='Dictionary of file repository content. Keys are relative filepaths and values are binary file '
'contents encoded as base64.',
is_attribute=False,
orm_to_model=lambda node, _: {
key: base64.encodebytes(content)
for key, content in node.base.repository.serialize_content().items() # type: ignore[attr-defined]
for key, content in cast('Node', node).base.repository.serialize_content().items()
},
exclude_from_cli=True,
exclude_to_orm=True,
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving this as is for now. Final implementation will depend on the upcoming discussion.

orm_to_model=lambda node, _: cast('Node', node).base.attributes.all,
is_subscriptable=True,
exclude_from_cli=True,
exclude_to_orm=True,
Copy link
Member Author

@edan-bainglass edan-bainglass Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Allowing attributes as an escape hatch for Data plugins that do not specify a model. This allows them to post via the attributes dictionary. Final decision on this will depend on a discussion (see aiidateam/AEP#40 (comment) for reference)

Comment on lines +262 to 270
computer: Optional[str] = MetadataField(
None,
description='The PK of the computer',
description='The label of the computer',
is_attribute=False,
orm_to_model=lambda node, _: node.computer.pk if node.computer else None, # type: ignore[attr-defined]
orm_class=Computer,
orm_to_model=lambda node, _: cast('Node', node).computer.label if cast('Node', node).computer else None, # type: ignore[union-attr]
model_to_orm=lambda model: cast('Node.Model', model).load_computer(),
exclude_from_cli=True,
exclude_to_orm=True,
)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is unclear if computer on the Node level represents anything different than what it does on RemoteData and InstalledCode. InstalledCode already "overwrites" this field in its model. RemoteData does not, but maybe should? Good to understand this better. Pinging @giovannipizzi

Comment on lines +324 to +333
attributes = kwargs.pop('attributes', {})

backend_entity = backend.nodes.create(
node_type=self.class_node_type, user=user.backend_entity, computer=backend_computer, **kwargs
)
super().__init__(backend_entity)

if attributes:
self.base.attributes.set_many(attributes)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The escape hatch discussed in a previous comment


@classmethod
def _from_model(cls, model: Model) -> 'Node': # type: ignore[override]
def from_model(cls, model: Model) -> 'Node': # type: ignore[override]
Copy link
Member Author

@edan-bainglass edan-bainglass Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to/from_model were originally "public" and were made "private". Restoring their "public" status here, as they are used by at least the REST API.

self._entity_type = entity_class

def __call__(self: CollectionType, backend: 'StorageBackend') -> CollectionType:
def __call__(self, backend: 'StorageBackend') -> Self:
Copy link
Member Author

@edan-bainglass edan-bainglass Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielhollas typing of Collection in this module is off. This is a minimal "fix", but the whole thing is not great. Happy to remove it from here if you think we should do it in another dedicated PR that addresses the wider issue.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've opened #7064 with this and other fixes. Feel free to push to that PR any fixes that you had in mind. Also I have pushed that branch to origin so you can rebase this PR on top of it if you want.

Comment on lines 151 to +164
order_by: Optional['OrderByType'] = None,
limit: Optional[int] = None,
offset: Optional[int] = None,
) -> List[EntityType]:
"""Find collection entries matching the filter criteria.
:param filters: the keyword value pair filters to match
:param order_by: a list of (key, direction) pairs specifying the sort order
:param limit: the maximum number of results to return
:param offset: number of initial results to be skipped
:return: a list of resulting matches
"""
query = self.query(filters=filters, order_by=order_by, limit=limit)
query = self.query(filters=filters, order_by=order_by, limit=limit, offset=offset)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding offset support to Collection.find(). Happy to move this to another PR.

pk: Optional[int] = MetadataField(
None,
description='The primary key of the entity. Can be `None` if the entity is not yet stored.',
model_config = ConfigDict(extra='forbid')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would not be a good idea to add whatever we want to a model I think, given that they are meant to represent a clear schema of an entity.

Comment on lines +200 to +208
@classmethod
def __pydantic_init_subclass__(cls, **kwargs: Any) -> None:
"""Sets the JSON schema title of the model.
The qualified name of the class is used, with dots removed. For example, `Node.Model` becomes `NodeModel`
in the JSON schema.
"""
super().__pydantic_init_subclass__(**kwargs)
cls.model_config['title'] = cls.__qualname__.replace('.', '')
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better JSON schema representation

Comment on lines +210 to +266
@classmethod
def as_input_model(cls: Type[EntityModelType]) -> Type[EntityModelType]:
"""Return a derived model class with read-only fields removed.
This also removes any serializers/validators defined on those fields.
:return: The derived input model class.
"""

# Derive the input model from the original model
new_name = cls.__qualname__.replace('.Model', 'InputModel')
InputModel = create_model( # noqa: N806
new_name,
__base__=cls,
__doc__=f'Input version of {cls.__name__}.',
)
InputModel.__qualname__ = new_name
InputModel.__module__ = cls.__module__

# Identify read-only fields
readonly_fields = [
name
for name, field in InputModel.model_fields.items()
if hasattr(field, 'json_schema_extra')
and isinstance(field.json_schema_extra, dict)
and field.json_schema_extra.get('readOnly')
]

# Remove read-only fields
for name in readonly_fields:
InputModel.model_fields.pop(name, None)
if hasattr(InputModel, name):
delattr(InputModel, name)

# Prune field validators/serializers referring to read-only fields
decorators = InputModel.__pydantic_decorators__

def _prune_field_decorators(field_decorators: dict[str, Any]) -> dict[str, Any]:
return {
name: decorator
for name, decorator in field_decorators.items()
if all(field not in readonly_fields for field in decorator.info.fields)
}

decorators.field_validators = _prune_field_decorators(decorators.field_validators)
decorators.field_serializers = _prune_field_decorators(decorators.field_serializers)

return InputModel

@classproperty
def InputModel(cls) -> Type[Model]: # noqa: N802, N805
"""Return the input version of the model class for this entity.
:return: The input model class, with read-only fields removed.
"""
return cls.Model.as_input_model()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The InputModel is a view of the Model used in entity creation via from_model.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@danielhollas it is unclear how to best type InputModel, It is derived from Model but is only partially Model, so type[Model] is not strictly correct and will lead to errors if one tries to do, for example, Entity.InputModel.pk (pk is read-only). The other extreme is to type it as BaseModel, exposing no part of Model to static analysis. I think this "might" be more correct though. Not sure.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I looked at how you are dynamicall constructing InputModel and it's no wonder that static analysis will not be great here. :-)

Without having enough context, I'd go with type[Model] for now, as this is also what mypy suggests and will not generate false positives, but would be great to add a comment about this in the code so we can adjust in the future.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, just wasn't sure how to improve. Happy to get some input on this 🙏

Comment on lines +304 to 333
def to_model(self, repository_path: Optional[pathlib.Path] = None, unstored: bool = False) -> Model:
"""Return the entity instance as an instance of its model.
:param repository_path: If the orm node has files in the repository, this path is used to read the repository
files from. If no path is specified a temporary path is created using the entities pk.
:param unstored: If True, the input version of the model is used, which strips read-only fields, i.e., fields
with `exclude_to_orm=True`.
:return: An instance of the entity's model class.
"""
fields = {}

for key, field in self.Model.model_fields.items():
Model = self.InputModel if unstored else self.Model # noqa: N806

for key, field in Model.model_fields.items():
if orm_to_model := get_metadata(field, 'orm_to_model'):
fields[key] = orm_to_model(self, repository_path)
else:
fields[key] = getattr(self, key)

return self.Model(**fields)
return Model(**fields)

@classmethod
def _from_model(cls, model: Model) -> 'Entity':
"""Return an entity instance from an instance of its model."""
def from_model(cls, model: Model) -> 'Entity':
"""Return an entity instance from an instance of its model.
:param model: An instance of the entity's model class.
:return: An instance of the entity class.
"""
fields = cls.model_to_orm_field_values(model)
return cls(**fields)
Copy link
Member Author

@edan-bainglass edan-bainglass Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As already mentioned, these are now public, as they are used also externally (e.g., REST API). The new unstored argument of to_model also of serialize) allows serialization of unstored nodes (which may have a use case beyond testing).

def serialize(
self,
repository_path: Optional[pathlib.Path] = None,
mode: Literal['json', 'python'] = 'json',
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mode parameter removes the need to explicitly define various serializers manually. Here we default to JSON, but one can serialize including Python objects if working within a Python-exclusive system.

fields[key] = add_field(
key,
alias=get_metadata(field, 'alias', None),
alias=field.alias,
Copy link
Member Author

@edan-bainglass edan-bainglass Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Original use actually misses alias, as it is not stored in the metadata.

Comment on lines +357 to +359
def keys(self) -> list[str]:
"""Return the field keys, prefixed with 'attribute.' if field is an attribute."""
return [field.backend_key for field in self._fields.values()]
Copy link
Member Author

@edan-bainglass edan-bainglass Oct 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Useful, but unrelated. May move to another PR if requested.

Comment on lines +54 to +56
'db_dbuser': {
'pk': 'id',
},
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missed in #6245

Copy link
Member Author

@edan-bainglass edan-bainglass left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope my comments help reviewers 🙏

Also some that are no longer required
@edan-bainglass edan-bainglass marked this pull request as ready for review October 12, 2025 09:16
@edan-bainglass edan-bainglass changed the title Update node serialization/deserialization Update node serialization/deserialization and other Pydantic issues Oct 12, 2025
@edan-bainglass
Copy link
Member Author

@sphuber @chrisjsewell if you have time 🙏 You've discussed this quite a bit on the referenced AEP to warrant a review of the current state. You might also want to have a look at the use case of the REST API in this draft PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants