Skip to content

Conversation

seratch
Copy link
Member

@seratch seratch commented Sep 3, 2025

this is still in progress but will resolve #1614

@seratch seratch requested a review from rm-openai September 3, 2025 06:13
@seratch seratch added enhancement New feature or request feature:realtime labels Sep 3, 2025
Comment on lines 48 to 55
# Disable server-side interrupt_response to avoid truncating assistant audio
session_context = await runner.run(
model_config={
"initial_model_settings": {
"turn_detection": {"type": "semantic_vad", "interrupt_response": False}
}
}
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to do this by default? why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I explored some changes to make the audio output quality, but they're not related to the gpt-realtime migration. So, I've reverted all of them. I will continue seeing improvements for this example app, but it can be done with a separate pull request.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was testing to change to new voices, this is taken from the examples (examples/realtime/app)

    model_settings: RealtimeSessionModelSettings = {
        "model_name": "gpt-realtime",
        "modalities": ["text", "audio"],
        "voice": "marin",
        "speed": 1.0,
        "input_audio_format": "pcm16",
        "output_audio_format": "pcm16",
        "input_audio_transcription": {
            "model": "gpt-4o-mini-transcribe",
        },
        "turn_detection": {"type": "semantic_vad", "threshold": 0.5},
        # "instructions": "…",                   # optional
        # "prompt": "…",                         # optional
        # "tool_choice": "auto",                 # optional
        # "tools": [],                           # optional
        # "handoffs": [],                        # optional
        # "tracing": {"enabled": False},         # optional
    }
    config = RealtimeRunConfig(model_settings=model_settings)
    runner = RealtimeRunner(starting_agent=get_starting_agent())
    
    I noticied that voice is changed but I lost all agents handoff, tool, etc.
    
    I setted config via RealtimeRunConfig and RealtimeModelConfig. In both cases happened the same.

@@ -93,7 +111,9 @@ async def _serialize_event(self, event: RealtimeSessionEvent) -> dict[str, Any]:
base_event["tool"] = event.tool.name
base_event["output"] = str(event.output)
elif event.type == "audio":
base_event["audio"] = base64.b64encode(event.audio.data).decode("utf-8")
# Coalesce raw PCM and flush on a steady timer for smoother playback.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this just a quality improvement? would be nice to make it be a separate PR if so

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, same with above (I won't repeat this for the rest)

@seratch seratch force-pushed the realtime-ga branch 2 times, most recently from a4333dd to f02b096 Compare September 4, 2025 10:20
@seratch seratch marked this pull request as ready for review September 4, 2025 10:22
@KelSolaar
Copy link

Hello,

Any ETA on this one? I could be using it right now. :)

Cheers,

Thomas

@na-proyectran
Copy link

na-proyectran commented Sep 8, 2025

Hi @seratch, do you know if this PR is going to be merged this week? No pressure, just to know ETA in this cases. Thanks you very much!

By the way, class OpenAIRealtimeWebSocketModel(RealtimeModel) has "gpt-4o-realtime-preview" by default (and you can't change it). Should by nice to set to "gpt-realtime".

@adinin
Copy link

adinin commented Sep 8, 2025

Hi @seratch, do you know if this PR is going to be merged this week? No pressure, just to know ETA in this cases. Thanks you very much!

not to speak for @seratch, but this is probably mostly dependent more on the review from @rm-openai

@KelSolaar
Copy link

@seratch :

FYI, noted that with OpenAI 1.107.0, I get this import error using your branch:

  File "\.venv\Lib\site-packages\agents\realtime\__init__.py", line 84, in <module>
    from .openai_realtime import (
    ...<3 lines>...
    )
  File "\.venv\Lib\site-packages\agents\realtime\openai_realtime.py", line 32, in <module>
    from openai.types.realtime.realtime_audio_config import (
    ...<3 lines>...
    )
ImportError: cannot import name 'Input' from 'openai.types.realtime.realtime_audio_config' (\.venv\Lib\site-packages\openai\types\realtime\realtime_audio_config.py)

@seratch
Copy link
Member Author

seratch commented Sep 9, 2025

@KelSolaar Thanks for letting me know this! Will resolve the conflicts.

@KelSolaar
Copy link

You are very much welcome! The new model has also mostly solved the issue I reported here: #1681

@seratch seratch marked this pull request as draft September 9, 2025 06:05
@na-proyectran
Copy link

@rm-openai @seratch What about changing OpenAIRealtimeWebSocketModel(RealtimeModel) model from "gpt-4o-realtime-preview" to "gpt-realtime"? Should be nice to have it as default, or better, to make possible to select realtime model to use.

@seratch
Copy link
Member Author

seratch commented Sep 9, 2025

@na-proyectran This pull request already does the change. Once this is released, the default model will be changed.

Right now, we're waiting for the underlying openai package updates. So, it may take a bit more time. Thank you all for waiting for a while.

@na-proyectran
Copy link

na-proyectran commented Sep 9, 2025

@seratch :

FYI, noted that with OpenAI 1.107.0 (released 16h ago), I get this import error using your branch:

  File "\.venv\Lib\site-packages\agents\realtime\__init__.py", line 84, in <module>
    from .openai_realtime import (
    ...<3 lines>...
    )
  File "\.venv\Lib\site-packages\agents\realtime\openai_realtime.py", line 32, in <module>
    from openai.types.realtime.realtime_audio_config import (
    ...<3 lines>...
    )
ImportError: cannot import name 'Input' from 'openai.types.realtime.realtime_audio_config' (\.venv\Lib\site-packages\openai\types\realtime\realtime_audio_config.py)

Not the only, in openai-python (release 1.107.0) they removed other things like:

from openai.types.realtime.realtime_tools_config_union import (
Function as OpenAISessionFunction,
)
-> Function (now only MCP)
from openai.types.realtime.realtime_session import (
RealtimeSession as OpenAISessionObject,
)
-> realtime_session (no longer here)

from openai.types.realtime.realtime_audio_config import (
Input as OpenAIRealtimeAudioInput,
Output as OpenAIRealtimeAudioOutput,
RealtimeAudioConfig as OpenAIRealtimeAudioConfig,
)
-> OpenAIRealtimeAudioOutput (no longer)
-> OpenAIRealtimeAudioInput (no longer)

@WesselBosscher
Copy link

@na-proyectran This pull request already does the change. Once this is released, the default model will be changed.

Right now, we're waiting for the underlying openai package updates. So, it may take a bit more time. Thank you all for waiting for a while.

sounds great! do you have an idea when that will be? should I think of days, weeks, months?

thanks!

@KelSolaar
Copy link

@na-proyectran This pull request already does the change. Once this is released, the default model will be changed.
Right now, we're waiting for the underlying openai package updates. So, it may take a bit more time. Thank you all for waiting for a while.

sounds great! do you have an idea when that will be? should I think of days, weeks, months?

thanks!

The pull request is essentially functional as is and can be tested, just make sure that you pin your requirements:

    "openai==1.106.1",
    "openai-agents @ git+https://github.com/openai/openai-agents-python@realtime-ga",

@KelSolaar
Copy link

Hello,

I'm looking for image input, and unless I'm missing something, it is not supported at the moment right?

From agents\realtime\openai_realtime.py:

    @classmethod
    def convert_user_input_to_conversation_item(
        cls, event: RealtimeModelSendUserInput
    ) -> OpenAIConversationItem:
        user_input = event.user_input

        if isinstance(user_input, dict):
            return RealtimeConversationItemUserMessage(
                type="message",
                role="user",
                content=[
                    Content(
                        type="input_text",
                        text=item.get("text"),
                    )
                    for item in user_input.get("content", [])
                ],
            )
        else:
            return RealtimeConversationItemUserMessage(
                type="message",
                role="user",
                content=[Content(type="input_text", text=user_input)],
            )

The API should look like this:

{
    "type": "conversation.item.create",
    "previous_item_id": null,
    "item": {
        "type": "message",
        "role": "user",
        "content": [
            {
                "type": "input_image",
                "image_url": "data:image/{format(example: png)};base64,{some_base64_image_bytes}"
            }
        ]
    }
}

@seratch
Copy link
Member Author

seratch commented Sep 9, 2025

@KelSolaar Thanks for pointing the lack out. The image input should be supported but it's missing here now. I will update the code to cover the use case too.

@KelSolaar
Copy link

@KelSolaar Thanks for pointing the lack out. The image input should be supported but it's missing here now. I will update the code to cover the use case too.

Thanks a ton and sorry for making this PR harder to push through!

@na-proyectran
Copy link

@na-proyectran This pull request already does the change. Once this is released, the default model will be changed.
Right now, we're waiting for the underlying openai package updates. So, it may take a bit more time. Thank you all for waiting for a while.

sounds great! do you have an idea when that will be? should I think of days, weeks, months?
thanks!

The pull request is essentially functional as is and can be tested, just make sure that you pin your requirements:

    "openai==1.106.1",
    "openai-agents @ git+https://github.com/openai/openai-agents-python@realtime-ga",

It's, just pointing new openai release.

  • feat(api): ship the RealtimeGA API shape
    Updates types to use the GA shape for Realtime API
  • release: 1.107.0

I mean, should by nice to sync with last openai release

@aligokalppeker
Copy link

aligokalppeker commented Sep 10, 2025

Besides the default model defined in it, I think the real-time model in the master also uses beta data structures defined in the OpenAI SDK package. I hope this PR can solve this issue. Don't want to press on, but is there any ETA on the release? thanks

#1708

@@ -84,7 +84,5 @@ jobs:
enable-cache: true
- name: Install dependencies
run: make sync
- name: Install Python 3.9 dependencies
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved to makefile

@@ -100,7 +100,8 @@ celerybeat.pid
*.sage.py

# Environments
.env
.python-version
.env*
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for local python 3.9 tests

// Audio playback queue
this.audioQueue = [];
this.isPlayingAudio = false;
this.playbackAudioContext = null;
this.currentAudioSource = null;

this.currentAudioGain = null; // per-chunk gain for smooth fades
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adjusted internals of this JS code to more smoothly play the audio chunks (less gain noise)

this.muteBtn.addEventListener('click', () => {
this.toggleMute();
});

// Image upload
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for image file inputs

@@ -4,6 +4,6 @@


def calculate_audio_length_ms(format: RealtimeAudioFormat | None, audio_bytes: bytes) -> float:
if format and format.startswith("g711"):
if format and isinstance(format, str) and format.startswith("g711"):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how the format data could be either str or dict/class

from ..logger import logger


def to_realtime_audio_format(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TS SDK does the same

@@ -103,17 +130,23 @@
RealtimeModelSendUserInput,
)

# Avoid direct imports of non-exported names by referencing via module
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just for mypy warnings

_USER_AGENT = f"Agents/Python {__version__}"

DEFAULT_MODEL_SETTINGS: RealtimeSessionModelSettings = {
"voice": "ash",
"modalities": ["text", "audio"],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial release of gpt-realtime does not support having both, so changed this default settings; you can still receive transcript in addition to audio chunks

@@ -495,40 +519,103 @@ async def _cancel_response(self) -> None:

async def _handle_ws_event(self, event: dict[str, Any]):
await self._emit_event(RealtimeModelRawServerEvent(data=event))
# The public interface definedo on this Agents SDK side (e.g., RealtimeMessageItem)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as mentioned here, this SDK's public interface was the same with beta API's data structure and the GA ones are slightly different. Thus, converting the data to fill the gap here

elif parsed.type == "error":
await self._emit_event(RealtimeModelErrorEvent(error=parsed.error))
elif parsed.type == "conversation.item.deleted":
await self._emit_event(RealtimeModelItemDeletedEvent(item_id=parsed.item_id))
elif (
parsed.type == "conversation.item.created"
parsed.type == "conversation.item.added"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is necessary to detect the user input item addition

@seratch seratch marked this pull request as ready for review September 11, 2025 06:47
Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codex Review: Here are some suggestions.

Reply with @codex fix comments to fix any unresolved comments.

About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request feature:realtime
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support for gpt-realtime
7 participants