Use model class names as tags in `format_as_xml` and add option to include field titles and descriptions as attributes #2313

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

DouweM merged 38 commits into pydantic:main from giacbrd:feature/xml-attributes

Sep 19, 2025

Contributor

giacbrd commented Jul 25, 2025

The current helper format_as_xml allows to transform any Python object into a XML string, which is a preferable format for ingesting structured data into LLMs.

This PR adds an optional parameter to this helper for exploiting Pydantic Field metadata: attributes like title, description or alias. These can be serialized in the XML as element attributes.

This is an easy approach for the developer in order to help the LLM to understand the structured data fields, beyond their names.

Basic example:

class Person(BaseModel):
    name: str = Field(description="The person's name")
    age: int = Field(description='Years', title='Age', default=18)

person = Person(name="John", age=42)

person becomes

<name description="The person's name">John</name>
<age title="Age" description="Years">42</age>

Future developments could be:

Setting attributes only down to a specific level in nested objects or avoiding repeating attributes in lists of objects.
Creating a general natural language description of a model based on its definition (e.g. "A person data is made of a name and ...")

giacbrd and others added 15 commits

July 17, 2025 23:45


          add fields attributes in xml serialization

a9b5e5f


          more tests for xml serialization of prompts and partially managing ne…

c560f3c

…sted fields attributes


          Merge branch 'main' into feature/xml-attributes

bab5922

# Conflicts:
#	pydantic_ai_slim/pydantic_ai/format_prompt.py


          tests fix

f99ff5d


          parse data structures for correct attributes and element names settings

3f2c5dc


          nested xml formatting tests

051aa93


          fixing deprecation warnings

490afa4


          fixing Self import

7e1ce2f


          fixing annotations for older python

ab47813


          Merge branch 'main' into feature/xml-attributes

ad43c00


          fix coverage for xml parse

f6b0cb8


          Merge branch 'main' into feature/xml-attributes

e1cbf5f


          Merge branch 'main' into feature/xml-attributes

9e6f376


          test more xml without attributes

d9a73c8


          Merge branch 'main' into feature/xml-attributes

f223496

DouweM requested changes

View reviewed changes

Collaborator

DouweM left a comment

@giacbrd Thanks Giacomo, it's a nice feature!

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated Show resolved Hide resolved

tests/test_format_as_xml.py Outdated

    
                  <location title="Location">null</location>

                </ExamplePydanticFields>

                <ExamplePydanticFields>

                  <name description="The person's name">Alice</name>

Collaborator

DouweM Jul 29, 2025

As you suggested in the description, I'd really like to include these attributes only the first time the field is seen, so we don't unnecessarily flood the LLM context.

Contributor Author

giacbrd Aug 12, 2025

OK I have added a parameter: I would like to leave the option of adding attributes at each object occurrence. I imagine cases where I have a complex object A, with many fields and deep structure. In this deep structure an object B can occur in “distant” spots. For an LLM could be tricky to recognize the semantic of the object B at every occurrence, given it would be described only at the first occurrence (where "distance" is in terms of tokens)

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated Show resolved Hide resolved

pydantic_ai_slim/pydantic_ai/format_prompt.py Show resolved Hide resolved

DouweM self-assigned this

DouweM added the awaiting author revision label

Contributor Author

giacbrd commented Aug 4, 2025

@DouweM thanks for the review, I am currently on vacation, I will reply to your comments, and make the changes, next week

giacbrd added 5 commits

August 12, 2025 12:40


          on format xml: parameter name and cleaned code

ba5c034


          parameter for avoiding repeating attributes serialization in xml format

595234f


          minor fix on create element in xml format

07d737c


          minor fix on create element in xml format

1d3473a


          xml format methods refactoring

01d3ffd

DouweM requested changes

View reviewed changes

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated

    
                          # before serializing the model and losing all the metadata of other data structures contained in it,

                          # we extract all the fields info and class names

                          self._init_fields_info()

                          self._init_element_names()

Collaborator

DouweM Aug 13, 2025

These 2 calls end up calling _parse_data_structures twice, could we do it just once?

Collaborator

DouweM Aug 13, 2025

Combined with my suggestion to always initialize _fields and _element_names as empty dicts, I think we can call self._parse_data_structures(self.data) when we see a BaseModel or dataclass and handle which (or both) of the two to populate in there

Contributor Author

giacbrd Aug 19, 2025

I have committed a solution for calling _parse_data_structures once. Before, I initialized these data structures with None for treating them as singletons, they must be created once. After they are populated they could be empty dictionaries. There are cases where not having a value that means "no initialization" could be tricky. E.g., a long list of models where fields have not attributes filled. We would call _parse_data_structures for each model and _fields would always remain an empty dictionary.

Now I use a flag _is_info_extracted so I make sure _parse_data_structures is called once and for all. We now call it for fields info even if we only have dataclasses, so no attributes to extract from any Pydantic Field. I have relaxed these checks because I expect to extract also dataclasses' field metadata in future developments.

The solution of an explicit method for the logics of initialization, even if trivial, looks clear to me. Moreover, ruff would complain of the code complexity if I keep these logics in _to_xml or in _parse_data_structures.

pydantic_ai_slim/pydantic_ai/format_prompt.py Show resolved Hide resolved

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated

    
                  # a map of Pydantic Field paths to their metadata: a field unique string representation and its class

                  _fields: dict[str, tuple[str, FieldInfo | ComputedFieldInfo]] | None = None

                  # keep track of fields we have extracted attributes from

                  _parsed_fields: set[str] = field(default_factory=set)

Collaborator

DouweM Aug 13, 2025

This more like included_fields right?

Contributor Author

giacbrd Aug 19, 2025

changed

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated

    
                          for k, v in value.items():  # pyright: ignore[reportUnknownVariableType]

                              cls._parse_data_structures(v, element_names, fields_map, f'{path}.{k}' if path else f'{k}')

                      elif is_dataclass(value) and not isinstance(value, type):

                          if element_names is not None:

Collaborator

DouweM Aug 13, 2025

Could we give self._element_names a default value of {} and always wriet directly into that instead of checking for None and passing element_names around as an arg?

Same for fields_map

Contributor Author

giacbrd Aug 19, 2025

see comment below

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated

    
                              item_el = self.to_xml(item, None)

                              element.append(item_el)

                          for n, item in enumerate(value):  # pyright: ignore[reportUnknownVariableType,reportUnknownArgumentType]

                              element.append(self._to_xml(item, None, f'{path}.[{n}]' if path else f'[{n}]'))

Collaborator

DouweM Aug 13, 2025

Since _to_xml tag can be None, can we make that a default value so we can skip passing None here?

Contributor Author

giacbrd Aug 19, 2025

done

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated Show resolved Hide resolved

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated

    
                  def to_xml(self, tag: str | None) -> ElementTree.Element:

                      return self._to_xml(self.data, tag)

                  def _to_xml(self, value: Any, tag: str | None, path: str = '') -> ElementTree.Element:

Collaborator

DouweM Aug 13, 2025

If path should only be omitted for the root node, I think we should make it required and pass '' explicitly there

Contributor Author

giacbrd Aug 19, 2025

done

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated Show resolved Hide resolved

giacbrd and others added 5 commits

August 19, 2025 12:03


          minor optimization

42e5f5f

Co-authored-by: Douwe Maan <[email protected]>


          minor refactoring

0a22655


          optimized structure info creation when format xml

0718be7


          optimized structure info creation when format xml

33ccd0e


          optimized structure info creation when format xml (minor refactoring)

0e99669

giacbrd added 6 commits

August 19, 2025 15:27


          optimized element creation when format xml

87ee7bd


          refactored arguments of format xml method

7956aff


          extract also dataclasses field metadata for xml format

a551bc8


          minor improvement in tests for xml format

05323d2


          coverage fix in tests for xml format

ce6d90f


          coverage fix in tests for xml format

6bd3617

DouweM requested changes

View reviewed changes

Collaborator

DouweM left a comment

@giacbrd Sorry for the delay in reviewing, thanks for the changes, we're almost there!

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated Show resolved Hide resolved

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated Show resolved Hide resolved

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated Show resolved Hide resolved

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated Show resolved Hide resolved

pydantic_ai_slim/pydantic_ai/format_prompt.py Outdated Show resolved Hide resolved

DouweM mentioned this pull request

Expose low level APIs of format_as_xml #2905

Closed

giacbrd and others added 7 commits

September 16, 2025 23:23


          merged parameters of xml format

2fd9ba8


          minor optimization

91a2b10

Co-authored-by: Douwe Maan <[email protected]>


          minor optimization

2c912cd

Co-authored-by: Douwe Maan <[email protected]>


          minor optimization

6e8f2c7

Co-authored-by: Douwe Maan <[email protected]>


          no more alias attribute in formatted xml elements

62d7367


          Merge branch 'main' into feature/xml-attributes

6d7b17f


          UP038 fix

429da71

DouweM changed the title ~~add XML attributes when formatting Pydantic models in prompts~~ Use model class names as tags in format_as_xml and add option to include field titles and descriptions as attributes

DouweM merged commit 556bf56 into pydantic:main

34 checks passed

Collaborator

DouweM commented Sep 19, 2025

@giacbrd Thanks a lot Giacomo!

Contributor Author

giacbrd commented Sep 21, 2025

@DouweM you're welcome!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

awaiting author revision