Skip to content

add XML attributes when formatting Pydantic models in prompts #2313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

giacbrd
Copy link
Contributor

@giacbrd giacbrd commented Jul 25, 2025

The current helper format_as_xml allows to transform any Python object into a XML string, which is a preferable format for ingesting structured data into LLMs.

This PR adds an optional parameter to this helper for exploiting Pydantic Field metadata: attributes like title, description or alias. These can be serialized in the XML as element attributes.

This is an easy approach for the developer in order to help the LLM to understand the structured data fields, beyond their names.

Basic example:

class Person(BaseModel):
    name: str = Field(description="The person's name")
    age: int = Field(description='Years', title='Age', default=18)

person = Person(name="John", age=42)

person becomes

<name description="The person's name">John</name>
<age title="Age" description="Years">42</age>

Future developments could be:

  • Setting attributes only down to a specific level in nested objects or avoiding repeating attributes in lists of objects.
  • Creating a general natural language description of a model based on its definition (e.g. "A person data is made of a name and ...")

Copy link
Collaborator

@DouweM DouweM left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@giacbrd Thanks Giacomo, it's a nice feature!

<location title="Location">null</location>
</ExamplePydanticFields>
<ExamplePydanticFields>
<name description="The person's name">Alice</name>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you suggested in the description, I'd really like to include these attributes only the first time the field is seen, so we don't unnecessarily flood the LLM context.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I have added a parameter: I would like to leave the option of adding attributes at each object occurrence. I imagine cases where I have a complex object A, with many fields and deep structure. In this deep structure an object B can occur in “distant” spots. For an LLM could be tricky to recognize the semantic of the object B at every occurrence, given it would be described only at the first occurrence (where "distance" is in terms of tokens)

@giacbrd
Copy link
Contributor Author

giacbrd commented Aug 4, 2025

@DouweM thanks for the review, I am currently on vacation, I will reply to your comments, and make the changes, next week

# before serializing the model and losing all the metadata of other data structures contained in it,
# we extract all the fields info and class names
self._init_fields_info()
self._init_element_names()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These 2 calls end up calling _parse_data_structures twice, could we do it just once?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Combined with my suggestion to always initialize _fields and _element_names as empty dicts, I think we can call self._parse_data_structures(self.data) when we see a BaseModel or dataclass and handle which (or both) of the two to populate in there

return self._to_xml(self.data, tag)

def _to_xml(self, value: Any, tag: str | None, path: str = '') -> ElementTree.Element:
element = self._create_element(self.item_tag if tag is None else tag, path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We create a new element in some cases below, can we change this to only build the element we're actually going to use?

# a map of Pydantic Field paths to their metadata: a field unique string representation and its class
_fields: dict[str, tuple[str, FieldInfo | ComputedFieldInfo]] | None = None
# keep track of fields we have extracted attributes from
_parsed_fields: set[str] = field(default_factory=set)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This more like included_fields right?

for k, v in value.items(): # pyright: ignore[reportUnknownVariableType]
cls._parse_data_structures(v, element_names, fields_map, f'{path}.{k}' if path else f'{k}')
elif is_dataclass(value) and not isinstance(value, type):
if element_names is not None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we give self._element_names a default value of {} and always wriet directly into that instead of checking for None and passing element_names around as an arg?

Same for fields_map

item_el = self.to_xml(item, None)
element.append(item_el)
for n, item in enumerate(value): # pyright: ignore[reportUnknownVariableType,reportUnknownArgumentType]
element.append(self._to_xml(item, None, f'{path}.[{n}]' if path else f'[{n}]'))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since _to_xml tag can be None, can we make that a default value so we can skip passing None here?

Comment on lines +203 to +207
for attr in cls._FIELD_ATTRIBUTES:
attr_value = getattr(info, attr, None)
if attr_value is not None:
attributes[attr] = str(attr_value)
return attributes
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be a oneliner, so we may not need a method:

Suggested change
for attr in cls._FIELD_ATTRIBUTES:
attr_value = getattr(info, attr, None)
if attr_value is not None:
attributes[attr] = str(attr_value)
return attributes
return {
attr: str(value)
for attr in cls._FIELD_ATTRIBUTES
if (value := getattr(info, attr, None)) is not None
}

def to_xml(self, tag: str | None) -> ElementTree.Element:
return self._to_xml(self.data, tag)

def _to_xml(self, value: Any, tag: str | None, path: str = '') -> ElementTree.Element:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If path should only be omitted for the root node, I think we should make it required and pass '' explicitly there

Comment on lines +184 to +185
cls._parse_data_structures(v, element_names, fields_map, f'{path}.{k}' if path else f'{k}')
elif isinstance(value, BaseModel):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dataclass fields can also have descriptions, via field(metadata=) or Pydantic Field. See also https://docs.pydantic.dev/latest/concepts/dataclasses/. Any chance we can pull those out as well?

We may want to use TypeAdapter (as documented there) and use its JSON schema to get the values as that handles both dataclasses and basemodels already. Or if not use it directly, see how it does it and if we can use those same methods

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be the worst idea to use a TypeAdapter anyway, create JSON and JSON schema, and then use those to build the XML, so we don't have to handle dataclasses and BaseModels ourselves at all. That may be complicated with $refs and $defs though...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants