-
Notifications
You must be signed in to change notification settings - Fork 230
Update node serialization/deserialization and other Pydantic issues #6990
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Update node serialization/deserialization and other Pydantic issues #6990
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6990 +/- ##
==========================================
+ Coverage 27.16% 27.22% +0.07%
==========================================
Files 566 566
Lines 43807 43965 +158
==========================================
+ Hits 11896 11967 +71
- Misses 31911 31998 +87 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
9b7cfbd
to
d83ced6
Compare
@khsrali regarding one of the failed tests...
In this PR, I lift some validation to the top of the constructor (of stash data classes - see 91e4ad5), to avoid any operations if the object is bound to fail. However, this seems to introduce failures in testing them. It suggests that perhaps there is some order to the operations, though I don't see it. Can you please comment? |
It was checking for |
@GeigerJ2 okay, this is ready for others to inspect. I did my best to isolate the commits and provided comments on each. Happy to discuss further. Pinging also @agoscinski @superstar54 if interested. Pinging @sphuber for input/feedback, if he has time. |
It may be possible to rely on the post models of aiida-restapi as a reference for defining ORM constructor parameters, as the post models are intended to represent serialized objects passed to the REST API for object construction. Looking into this. |
c8ded01
to
476e144
Compare
476e144
to
7c191fa
Compare
@danielhollas what is this about?
Nevermind. I see that importing |
516a6df
to
2c3e5df
Compare
Nice, the system works. :-) Feel free to improve the error message here to make it more obvious that this is about the |
d6e44ff
to
0b37ef6
Compare
|
@danielhollas what do you think about ignoring Ruff N806 - "~variable should be lowercase"? See case below.
|
Do you mean ignoring locally (fine) or globally? |
Global. There are many cases when the variable is a class, not an instance. I've pushed this in my last commit just to verify that it works. Okay with removing it in favor of local handling, but would like to hear the reason against a global N806 rule. UpdateNice. Ignoring N806 globally raised a whole lot of RUF100 due to the codebase being littered with local N806 rules. I'd say that supports my case 🙂 |
class Model(NumericType.Model): | ||
value: bool = MetadataField( | ||
title='Boolean value.', | ||
description='The value of the boolean', | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding these on all BaseType
subclasses to provide validation, because value: Any
in BaseType.Model
is pointless.
class UnhandledDataAttributesError(Exception): | ||
"""Exception raised when any data attributes are not handled prior to the Data constructor.""" | ||
|
||
def __init__(self, attributes: Iterable[str], class_name: str) -> None: | ||
bullet_list = '\n'.join(f' • {attr}' for attr in attributes) | ||
message = ( | ||
f'\nThe following attributes must be handled in a constructor prior to the Data class:\n' | ||
f'{bullet_list}\n\n' | ||
f'Consider implementing a constructor in {class_name} to handle the listed attributes.' | ||
) | ||
super().__init__(message) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See usage below
# We verify here that all attributes of Data plugins are handled in a constructor prior to the root | ||
# Data class (here), gracefully rejecting them otherwise. | ||
node_keys = set(Node.Model.model_fields.keys()) | {'backend'} | ||
unhandled_keys = {key for key in kwargs if key not in node_keys} | ||
if unhandled_keys: | ||
raise UnhandledDataAttributesError(unhandled_keys, self.__class__.__name__) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This catches cases where a Data
subclass defines a Model
without an accompanying constructor. In these cases, unhandled attributes defined on the model would bubble up to the backend node constructor, which is strict (no kwargs
allowed). This change acts as a gate at the root Data
class, informing users of the new pydantic system of the requirements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes here are not final and depend on the outcome of a discussion regarding file serialization.
kinds: t.List[dict] = MetadataField(description='The kinds of atoms') | ||
sites: t.List[dict] = MetadataField(description='The atomic sites') | ||
|
||
@field_validator('kinds', mode='before') | ||
@classmethod | ||
def _validate_kinds(cls, value: t.List[Kind | dict[str, t.Any]]) -> t.List[t.Dict]: | ||
return [kind.get_raw() if isinstance(kind, Kind) else kind for kind in value] | ||
|
||
@field_validator('sites', mode='before') | ||
@classmethod | ||
def _validate_sites(cls, value: t.List[Site | dict[str, t.Any]]) -> t.List[t.Dict]: | ||
return [site.get_raw() if isinstance(site, Site) else site for site in value] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is a structure without kinds
and sites
?
repository_content: dict[str, bytes] = MetadataField( | ||
default_factory=dict, | ||
description='Dictionary of file repository content. Keys are relative filepaths and values are binary file ' | ||
'contents encoded as base64.', | ||
is_attribute=False, | ||
orm_to_model=lambda node, _: { | ||
key: base64.encodebytes(content) | ||
for key, content in node.base.repository.serialize_content().items() # type: ignore[attr-defined] | ||
for key, content in cast('Node', node).base.repository.serialize_content().items() | ||
}, | ||
exclude_from_cli=True, | ||
exclude_to_orm=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this as is for now. Final implementation will depend on the upcoming discussion.
orm_to_model=lambda node, _: cast('Node', node).base.attributes.all, | ||
is_subscriptable=True, | ||
exclude_from_cli=True, | ||
exclude_to_orm=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Allowing attributes
as an escape hatch for Data
plugins that do not specify a model. This allows them to post via the attributes
dictionary. Final decision on this will depend on a discussion (see aiidateam/AEP#40 (comment) for reference)
computer: Optional[str] = MetadataField( | ||
None, | ||
description='The PK of the computer', | ||
description='The label of the computer', | ||
is_attribute=False, | ||
orm_to_model=lambda node, _: node.computer.pk if node.computer else None, # type: ignore[attr-defined] | ||
orm_class=Computer, | ||
orm_to_model=lambda node, _: cast('Node', node).computer.label if cast('Node', node).computer else None, # type: ignore[union-attr] | ||
model_to_orm=lambda model: cast('Node.Model', model).load_computer(), | ||
exclude_from_cli=True, | ||
exclude_to_orm=True, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is unclear if computer
on the Node
level represents anything different than what it does on RemoteData
and InstalledCode
. InstalledCode
already "overwrites" this field in its model. RemoteData
does not, but maybe should? Good to understand this better. Pinging @giovannipizzi
attributes = kwargs.pop('attributes', {}) | ||
|
||
backend_entity = backend.nodes.create( | ||
node_type=self.class_node_type, user=user.backend_entity, computer=backend_computer, **kwargs | ||
) | ||
super().__init__(backend_entity) | ||
|
||
if attributes: | ||
self.base.attributes.set_many(attributes) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The escape hatch discussed in a previous comment
|
||
@classmethod | ||
def _from_model(cls, model: Model) -> 'Node': # type: ignore[override] | ||
def from_model(cls, model: Model) -> 'Node': # type: ignore[override] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to/from_model
were originally "public" and were made "private". Restoring their "public" status here, as they are used by at least the REST API.
self._entity_type = entity_class | ||
|
||
def __call__(self: CollectionType, backend: 'StorageBackend') -> CollectionType: | ||
def __call__(self, backend: 'StorageBackend') -> Self: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielhollas typing of Collection
in this module is off. This is a minimal "fix", but the whole thing is not great. Happy to remove it from here if you think we should do it in another dedicated PR that addresses the wider issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've opened #7064 with this and other fixes. Feel free to push to that PR any fixes that you had in mind. Also I have pushed that branch to origin so you can rebase this PR on top of it if you want.
order_by: Optional['OrderByType'] = None, | ||
limit: Optional[int] = None, | ||
offset: Optional[int] = None, | ||
) -> List[EntityType]: | ||
"""Find collection entries matching the filter criteria. | ||
:param filters: the keyword value pair filters to match | ||
:param order_by: a list of (key, direction) pairs specifying the sort order | ||
:param limit: the maximum number of results to return | ||
:param offset: number of initial results to be skipped | ||
:return: a list of resulting matches | ||
""" | ||
query = self.query(filters=filters, order_by=order_by, limit=limit) | ||
query = self.query(filters=filters, order_by=order_by, limit=limit, offset=offset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding offset
support to Collection.find()
. Happy to move this to another PR.
pk: Optional[int] = MetadataField( | ||
None, | ||
description='The primary key of the entity. Can be `None` if the entity is not yet stored.', | ||
model_config = ConfigDict(extra='forbid') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would not be a good idea to add whatever we want to a model I think, given that they are meant to represent a clear schema of an entity.
@classmethod | ||
def __pydantic_init_subclass__(cls, **kwargs: Any) -> None: | ||
"""Sets the JSON schema title of the model. | ||
The qualified name of the class is used, with dots removed. For example, `Node.Model` becomes `NodeModel` | ||
in the JSON schema. | ||
""" | ||
super().__pydantic_init_subclass__(**kwargs) | ||
cls.model_config['title'] = cls.__qualname__.replace('.', '') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better JSON schema representation
@classmethod | ||
def as_input_model(cls: Type[EntityModelType]) -> Type[EntityModelType]: | ||
"""Return a derived model class with read-only fields removed. | ||
This also removes any serializers/validators defined on those fields. | ||
:return: The derived input model class. | ||
""" | ||
|
||
# Derive the input model from the original model | ||
new_name = cls.__qualname__.replace('.Model', 'InputModel') | ||
InputModel = create_model( # noqa: N806 | ||
new_name, | ||
__base__=cls, | ||
__doc__=f'Input version of {cls.__name__}.', | ||
) | ||
InputModel.__qualname__ = new_name | ||
InputModel.__module__ = cls.__module__ | ||
|
||
# Identify read-only fields | ||
readonly_fields = [ | ||
name | ||
for name, field in InputModel.model_fields.items() | ||
if hasattr(field, 'json_schema_extra') | ||
and isinstance(field.json_schema_extra, dict) | ||
and field.json_schema_extra.get('readOnly') | ||
] | ||
|
||
# Remove read-only fields | ||
for name in readonly_fields: | ||
InputModel.model_fields.pop(name, None) | ||
if hasattr(InputModel, name): | ||
delattr(InputModel, name) | ||
|
||
# Prune field validators/serializers referring to read-only fields | ||
decorators = InputModel.__pydantic_decorators__ | ||
|
||
def _prune_field_decorators(field_decorators: dict[str, Any]) -> dict[str, Any]: | ||
return { | ||
name: decorator | ||
for name, decorator in field_decorators.items() | ||
if all(field not in readonly_fields for field in decorator.info.fields) | ||
} | ||
|
||
decorators.field_validators = _prune_field_decorators(decorators.field_validators) | ||
decorators.field_serializers = _prune_field_decorators(decorators.field_serializers) | ||
|
||
return InputModel | ||
|
||
@classproperty | ||
def InputModel(cls) -> Type[Model]: # noqa: N802, N805 | ||
"""Return the input version of the model class for this entity. | ||
:return: The input model class, with read-only fields removed. | ||
""" | ||
return cls.Model.as_input_model() | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The InputModel
is a view of the Model
used in entity creation via from_model
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@danielhollas it is unclear how to best type InputModel
, It is derived from Model
but is only partially Model
, so type[Model]
is not strictly correct and will lead to errors if one tries to do, for example, Entity.InputModel.pk
(pk
is read-only). The other extreme is to type it as BaseModel
, exposing no part of Model
to static analysis. I think this "might" be more correct though. Not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I looked at how you are dynamicall constructing InputModel and it's no wonder that static analysis will not be great here. :-)
Without having enough context, I'd go with type[Model]
for now, as this is also what mypy
suggests and will not generate false positives, but would be great to add a comment about this in the code so we can adjust in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, just wasn't sure how to improve. Happy to get some input on this 🙏
def to_model(self, repository_path: Optional[pathlib.Path] = None, unstored: bool = False) -> Model: | ||
"""Return the entity instance as an instance of its model. | ||
:param repository_path: If the orm node has files in the repository, this path is used to read the repository | ||
files from. If no path is specified a temporary path is created using the entities pk. | ||
:param unstored: If True, the input version of the model is used, which strips read-only fields, i.e., fields | ||
with `exclude_to_orm=True`. | ||
:return: An instance of the entity's model class. | ||
""" | ||
fields = {} | ||
|
||
for key, field in self.Model.model_fields.items(): | ||
Model = self.InputModel if unstored else self.Model # noqa: N806 | ||
|
||
for key, field in Model.model_fields.items(): | ||
if orm_to_model := get_metadata(field, 'orm_to_model'): | ||
fields[key] = orm_to_model(self, repository_path) | ||
else: | ||
fields[key] = getattr(self, key) | ||
|
||
return self.Model(**fields) | ||
return Model(**fields) | ||
|
||
@classmethod | ||
def _from_model(cls, model: Model) -> 'Entity': | ||
"""Return an entity instance from an instance of its model.""" | ||
def from_model(cls, model: Model) -> 'Entity': | ||
"""Return an entity instance from an instance of its model. | ||
:param model: An instance of the entity's model class. | ||
:return: An instance of the entity class. | ||
""" | ||
fields = cls.model_to_orm_field_values(model) | ||
return cls(**fields) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As already mentioned, these are now public, as they are used also externally (e.g., REST API). The new unstored
argument of to_model
also of serialize
) allows serialization of unstored nodes (which may have a use case beyond testing).
def serialize( | ||
self, | ||
repository_path: Optional[pathlib.Path] = None, | ||
mode: Literal['json', 'python'] = 'json', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mode
parameter removes the need to explicitly define various serializers manually. Here we default to JSON, but one can serialize including Python objects if working within a Python-exclusive system.
fields[key] = add_field( | ||
key, | ||
alias=get_metadata(field, 'alias', None), | ||
alias=field.alias, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Original use actually misses alias
, as it is not stored in the metadata.
def keys(self) -> list[str]: | ||
"""Return the field keys, prefixed with 'attribute.' if field is an attribute.""" | ||
return [field.backend_key for field in self._fields.values()] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Useful, but unrelated. May move to another PR if requested.
'db_dbuser': { | ||
'pk': 'id', | ||
}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missed in #6245
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope my comments help reviewers 🙏
Also some that are no longer required
@sphuber @chrisjsewell if you have time 🙏 You've discussed this quite a bit on the referenced AEP to warrant a review of the current state. You might also want to have a look at the use case of the REST API in this draft PR. |
Some parts of the ORM module do not pass a serialization round trip. This PR addresses this, but further discussion is needed regarding what should actually be (de)serialized. If interested, good to read aiidateam/AEP#40 and discussion on #6255. Some discussion regarding mismatch between an object's constructor args and its Model fields can be found here.
Good to also discuss if the use of pydantic should be extended to ORM instance creation in general, not only by way of
from_serialized
. This is not implemented in this PR (and is out of scope) but can be addressed in a follow-up PR if deemed "correct".Open questions
repository_content
. The current implementation usesbase64
encoding/decoding, but this is clearly not ideal for large files. Some discussion was had w.r.t switching to links to some online storage of files. TBDUpdates
The PR now also introduces the following:
Data
plugins viaattributes
attributes
inData
plugins (not allowed in the backend node)InputModel
- a derived view of the defined entityModel
suitable forEntity
creationNote for PR reviewers
There are a few changes that are likely out of scope. These will move to dedicated PRs prior to merge of this PR.