-
Notifications
You must be signed in to change notification settings - Fork 2.4k
feat: #1614 gpt-realtime migration (Realtime API GA) #1646
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
examples/realtime/app/server.py
Outdated
# Disable server-side interrupt_response to avoid truncating assistant audio | ||
session_context = await runner.run( | ||
model_config={ | ||
"initial_model_settings": { | ||
"turn_detection": {"type": "semantic_vad", "interrupt_response": False} | ||
} | ||
} | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to do this by default? why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I explored some changes to make the audio output quality, but they're not related to the gpt-realtime migration. So, I've reverted all of them. I will continue seeing improvements for this example app, but it can be done with a separate pull request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was testing to change to new voices, this is taken from the examples (examples/realtime/app)
model_settings: RealtimeSessionModelSettings = {
"model_name": "gpt-realtime",
"modalities": ["text", "audio"],
"voice": "marin",
"speed": 1.0,
"input_audio_format": "pcm16",
"output_audio_format": "pcm16",
"input_audio_transcription": {
"model": "gpt-4o-mini-transcribe",
},
"turn_detection": {"type": "semantic_vad", "threshold": 0.5},
# "instructions": "…", # optional
# "prompt": "…", # optional
# "tool_choice": "auto", # optional
# "tools": [], # optional
# "handoffs": [], # optional
# "tracing": {"enabled": False}, # optional
}
config = RealtimeRunConfig(model_settings=model_settings)
runner = RealtimeRunner(starting_agent=get_starting_agent())
I noticied that voice is changed but I lost all agents handoff, tool, etc.
I setted config via RealtimeRunConfig and RealtimeModelConfig. In both cases happened the same.
examples/realtime/app/server.py
Outdated
@@ -93,7 +111,9 @@ async def _serialize_event(self, event: RealtimeSessionEvent) -> dict[str, Any]: | |||
base_event["tool"] = event.tool.name | |||
base_event["output"] = str(event.output) | |||
elif event.type == "audio": | |||
base_event["audio"] = base64.b64encode(event.audio.data).decode("utf-8") | |||
# Coalesce raw PCM and flush on a steady timer for smoother playback. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this just a quality improvement? would be nice to make it be a separate PR if so
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, same with above (I won't repeat this for the rest)
a4333dd
to
f02b096
Compare
Hello, Any ETA on this one? I could be using it right now. :) Cheers, Thomas |
Hi @seratch, do you know if this PR is going to be merged this week? No pressure, just to know ETA in this cases. Thanks you very much! By the way, class OpenAIRealtimeWebSocketModel(RealtimeModel) has "gpt-4o-realtime-preview" by default (and you can't change it). Should by nice to set to "gpt-realtime". |
not to speak for @seratch, but this is probably mostly dependent more on the review from @rm-openai |
@seratch : FYI, noted that with OpenAI 1.107.0, I get this import error using your branch: File "\.venv\Lib\site-packages\agents\realtime\__init__.py", line 84, in <module>
from .openai_realtime import (
...<3 lines>...
)
File "\.venv\Lib\site-packages\agents\realtime\openai_realtime.py", line 32, in <module>
from openai.types.realtime.realtime_audio_config import (
...<3 lines>...
)
ImportError: cannot import name 'Input' from 'openai.types.realtime.realtime_audio_config' (\.venv\Lib\site-packages\openai\types\realtime\realtime_audio_config.py) |
@KelSolaar Thanks for letting me know this! Will resolve the conflicts. |
You are very much welcome! The new model has also mostly solved the issue I reported here: #1681 |
@rm-openai @seratch What about changing OpenAIRealtimeWebSocketModel(RealtimeModel) model from "gpt-4o-realtime-preview" to "gpt-realtime"? Should be nice to have it as default, or better, to make possible to select realtime model to use. |
@na-proyectran This pull request already does the change. Once this is released, the default model will be changed. Right now, we're waiting for the underlying |
Not the only, in openai-python (release 1.107.0) they removed other things like: from openai.types.realtime.realtime_tools_config_union import ( from openai.types.realtime.realtime_audio_config import ( |
sounds great! do you have an idea when that will be? should I think of days, weeks, months? thanks! |
The pull request is essentially functional as is and can be tested, just make sure that you pin your requirements:
|
Hello, I'm looking for image input, and unless I'm missing something, it is not supported at the moment right? From @classmethod
def convert_user_input_to_conversation_item(
cls, event: RealtimeModelSendUserInput
) -> OpenAIConversationItem:
user_input = event.user_input
if isinstance(user_input, dict):
return RealtimeConversationItemUserMessage(
type="message",
role="user",
content=[
Content(
type="input_text",
text=item.get("text"),
)
for item in user_input.get("content", [])
],
)
else:
return RealtimeConversationItemUserMessage(
type="message",
role="user",
content=[Content(type="input_text", text=user_input)],
) The API should look like this: {
"type": "conversation.item.create",
"previous_item_id": null,
"item": {
"type": "message",
"role": "user",
"content": [
{
"type": "input_image",
"image_url": "data:image/{format(example: png)};base64,{some_base64_image_bytes}"
}
]
}
} |
@KelSolaar Thanks for pointing the lack out. The image input should be supported but it's missing here now. I will update the code to cover the use case too. |
Thanks a ton and sorry for making this PR harder to push through! |
It's, just pointing new openai release.
I mean, should by nice to sync with last openai release |
Besides the default model defined in it, I think the real-time model in the master also uses beta data structures defined in the OpenAI SDK package. I hope this PR can solve this issue. Don't want to press on, but is there any ETA on the release? thanks |
30bbd8d
to
7afde98
Compare
@@ -84,7 +84,5 @@ jobs: | |||
enable-cache: true | |||
- name: Install dependencies | |||
run: make sync | |||
- name: Install Python 3.9 dependencies |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved to makefile
@@ -100,7 +100,8 @@ celerybeat.pid | |||
*.sage.py | |||
|
|||
# Environments | |||
.env | |||
.python-version | |||
.env* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for local python 3.9 tests
// Audio playback queue | ||
this.audioQueue = []; | ||
this.isPlayingAudio = false; | ||
this.playbackAudioContext = null; | ||
this.currentAudioSource = null; | ||
|
||
this.currentAudioGain = null; // per-chunk gain for smooth fades |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
adjusted internals of this JS code to more smoothly play the audio chunks (less gain noise)
this.muteBtn.addEventListener('click', () => { | ||
this.toggleMute(); | ||
}); | ||
|
||
// Image upload |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for image file inputs
@@ -4,6 +4,6 @@ | |||
|
|||
|
|||
def calculate_audio_length_ms(format: RealtimeAudioFormat | None, audio_bytes: bytes) -> float: | |||
if format and format.startswith("g711"): | |||
if format and isinstance(format, str) and format.startswith("g711"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how the format data could be either str or dict/class
from ..logger import logger | ||
|
||
|
||
def to_realtime_audio_format( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TS SDK does the same
@@ -103,17 +130,23 @@ | |||
RealtimeModelSendUserInput, | |||
) | |||
|
|||
# Avoid direct imports of non-exported names by referencing via module |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just for mypy warnings
_USER_AGENT = f"Agents/Python {__version__}" | ||
|
||
DEFAULT_MODEL_SETTINGS: RealtimeSessionModelSettings = { | ||
"voice": "ash", | ||
"modalities": ["text", "audio"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The initial release of gpt-realtime does not support having both, so changed this default settings; you can still receive transcript in addition to audio chunks
@@ -495,40 +519,103 @@ async def _cancel_response(self) -> None: | |||
|
|||
async def _handle_ws_event(self, event: dict[str, Any]): | |||
await self._emit_event(RealtimeModelRawServerEvent(data=event)) | |||
# The public interface definedo on this Agents SDK side (e.g., RealtimeMessageItem) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as mentioned here, this SDK's public interface was the same with beta API's data structure and the GA ones are slightly different. Thus, converting the data to fill the gap here
elif parsed.type == "error": | ||
await self._emit_event(RealtimeModelErrorEvent(error=parsed.error)) | ||
elif parsed.type == "conversation.item.deleted": | ||
await self._emit_event(RealtimeModelItemDeletedEvent(item_id=parsed.item_id)) | ||
elif ( | ||
parsed.type == "conversation.item.created" | ||
parsed.type == "conversation.item.added" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is necessary to detect the user input item addition
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codex Review: Here are some suggestions.
Reply with @codex fix comments
to fix any unresolved comments.
About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you open a pull request for review, mark a draft as ready, or comment "@codex review". If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex fix this CI failure" or "@codex address that feedback".
this is still in progress but will resolve #1614