-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
Describe the bug
TEXT mode AssistantMessageItem has empty content in RealtimeHistoryUpdated/RealtimeHistoryAdded events
When using the realtime api in TEXT mode (modalities: ["text"]), the AssistantMessageItem objects provided in RealtimeHistoryUpdated and RealtimeHistoryAdded events have empty content arrays, even though the assistant has responded with text.
In VOICE mode (modalities: ["audio"]), the AssistantMessageItem correctly contains the audio transcript.
Root Cause
The bug is in openai_realtime.py in the _handle_ws_event() method (around line 550-560).
When processing response.output_item.done events, the sdk checks for content types:
if part.get("type") == "audio":
converted_content.append({
"type": "audio",
"audio": part.get("audio"),
"transcript": part.get("transcript"),
})
elif part.get("type") == "text":
converted_content.append({"type": "text", "text": part.get("text")})Problem: the realtime api sends TEXT mode content with type: "output_text", not type: "text".
The SDK correctly handles this conversion in _ConversionHelper.conversation_item_to_realtime_message_item() (line 949-954):
if each.type == "output_text":
# For backward-compatibility of assistant message items
c["type"] = "text"But this conversion is missing from _handle_ws_event(), so TEXT mode content is silently dropped.
the fix seems simple enough:
CHange:
elif part.get("type") == "text":
converted_content.append({"type": "text", "text": part.get("text")})to
elif part.get("type") in ("text", "output_text"):
converted_content.append({"type": "text", "text": part.get("text")})
Debug information
- Agents SDK version: 0.6.2
- Python version (e.g. Python 3.11)
Repro steps
- Create a RealtimeRunner with TEXT modality
- Send a user message and wait for the assistant response
- Listen for RealtimeHistoryUpdated or RealtimeHistoryAdded events
- Inspect the AssistantMessageItem - the content array will be empty
Expected behavior
The AssistantMessageItem.content should contain an AssistantText object with the response text, similar to how VOICE mode contains AssistantAudio with the transcript.
Extra (our workaround)
Subscribe to raw model events and extract text from response.output_text.done:
if isinstance(event, RealtimeRawModelEvent):
if isinstance(event.data, RealtimeModelRawServerEvent):
data = event.data.data
if data.get("type") == "response.output_text.done":
item_id = data.get("item_id")
text = data.get("text")
(plus extra state management on all the generic modality path in order to handle missing text from AssistantMessageItem when the modality is TEXT)