-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Use model class names as tags in format_as_xml
and add option to include field titles and descriptions as attributes
#2313
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…sted fields attributes
# Conflicts: # pydantic_ai_slim/pydantic_ai/format_prompt.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giacbrd Thanks Giacomo, it's a nice feature!
tests/test_format_as_xml.py
Outdated
<location title="Location">null</location> | ||
</ExamplePydanticFields> | ||
<ExamplePydanticFields> | ||
<name description="The person's name">Alice</name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As you suggested in the description, I'd really like to include these attributes only the first time the field is seen, so we don't unnecessarily flood the LLM context.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK I have added a parameter: I would like to leave the option of adding attributes at each object occurrence. I imagine cases where I have a complex object A, with many fields and deep structure. In this deep structure an object B can occur in “distant” spots. For an LLM could be tricky to recognize the semantic of the object B at every occurrence, given it would be described only at the first occurrence (where "distance" is in terms of tokens)
@DouweM thanks for the review, I am currently on vacation, I will reply to your comments, and make the changes, next week |
# before serializing the model and losing all the metadata of other data structures contained in it, | ||
# we extract all the fields info and class names | ||
self._init_fields_info() | ||
self._init_element_names() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These 2 calls end up calling _parse_data_structures
twice, could we do it just once?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Combined with my suggestion to always initialize _fields
and _element_names
as empty dicts, I think we can call self._parse_data_structures(self.data)
when we see a BaseModel
or dataclass
and handle which (or both) of the two to populate in there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have committed a solution for calling _parse_data_structures
once. Before, I initialized these data structures with None for treating them as singletons, they must be created once. After they are populated they could be empty dictionaries. There are cases where not having a value that means "no initialization" could be tricky. E.g., a long list of models where fields have not attributes filled. We would call _parse_data_structures
for each model and _fields
would always remain an empty dictionary.
Now I use a flag _is_info_extracted
so I make sure _parse_data_structures
is called once and for all. We now call it for fields info even if we only have dataclasses, so no attributes to extract from any Pydantic Field. I have relaxed these checks because I expect to extract also dataclasses' field metadata in future developments.
The solution of an explicit method for the logics of initialization, even if trivial, looks clear to me. Moreover, ruff would complain of the code complexity if I keep these logics in _to_xml
or in _parse_data_structures
.
# a map of Pydantic Field paths to their metadata: a field unique string representation and its class | ||
_fields: dict[str, tuple[str, FieldInfo | ComputedFieldInfo]] | None = None | ||
# keep track of fields we have extracted attributes from | ||
_parsed_fields: set[str] = field(default_factory=set) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This more like included_fields
right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed
for k, v in value.items(): # pyright: ignore[reportUnknownVariableType] | ||
cls._parse_data_structures(v, element_names, fields_map, f'{path}.{k}' if path else f'{k}') | ||
elif is_dataclass(value) and not isinstance(value, type): | ||
if element_names is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we give self._element_names
a default value of {}
and always wriet directly into that instead of checking for None
and passing element_names
around as an arg?
Same for fields_map
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see comment below
item_el = self.to_xml(item, None) | ||
element.append(item_el) | ||
for n, item in enumerate(value): # pyright: ignore[reportUnknownVariableType,reportUnknownArgumentType] | ||
element.append(self._to_xml(item, None, f'{path}.[{n}]' if path else f'[{n}]')) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since _to_xml
tag
can be None
, can we make that a default value so we can skip passing None
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
def to_xml(self, tag: str | None) -> ElementTree.Element: | ||
return self._to_xml(self.data, tag) | ||
|
||
def _to_xml(self, value: Any, tag: str | None, path: str = '') -> ElementTree.Element: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If path
should only be omitted for the root node, I think we should make it required and pass ''
explicitly there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@giacbrd Sorry for the delay in reviewing, thanks for the changes, we're almost there!
Co-authored-by: Douwe Maan <[email protected]>
Co-authored-by: Douwe Maan <[email protected]>
Co-authored-by: Douwe Maan <[email protected]>
format_as_xml
and add option to include field titles and descriptions as attributes
@giacbrd Thanks a lot Giacomo! |
@DouweM you're welcome! |
The current helper
format_as_xml
allows to transform any Python object into a XML string, which is a preferable format for ingesting structured data into LLMs.This PR adds an optional parameter to this helper for exploiting Pydantic Field metadata: attributes like
title
,description
oralias
. These can be serialized in the XML as element attributes.This is an easy approach for the developer in order to help the LLM to understand the structured data fields, beyond their names.
Basic example:
person
becomesFuture developments could be: