-
-
Notifications
You must be signed in to change notification settings - Fork 160
OpenAI-Dotnet 8.x #446
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
OpenAI-Dotnet 8.x #446
Conversation
This PR is currently able to establish a conversation with the OpenAI Realtime WbRTC endpoint. Sample program below. @StephenHodgson my main intent with the PR for now is to see if it's palatable to you? I haven't intruded too much into the existing code. It was necessary to do a tweak to the SessionConfiguration to avoid getting an error response from OpenAI. Sample below needs the follwoing packages and a Windows specific target such as
Update: Minor update to add missing response create call to get the conversation started. All working nicely now. using System;
using System.Net;
using System.Threading;
using System.Threading.Tasks;
using SIPSorcery.Net;
using SIPSorceryMedia.Windows;
using OpenAI;
using OpenAI.Realtime;
using System.Collections.Generic;
namespace demo;
class Program
{
private const string OPENAIKEY_ENVVAR = "OPENAIKEY";
private const string OPENAI_MODEL = "gpt-4o-realtime-preview-2024-12-17";
private const string OPENAI_VOICE = "shimmer";
static async Task Main()
{
Console.WriteLine("WebRTC OpenAI Demo Program");
var openAIKey = Environment.GetEnvironmentVariable(OPENAIKEY_ENVVAR);
if(string.IsNullOrWhiteSpace(openAIKey) )
{
Console.Error.WriteLine($"{OPENAIKEY_ENVVAR} environment variable not set, cannot continue.");
return;
}
var pcConfig = new RTCConfiguration
{
X_UseRtpFeedbackProfile = true
};
var openaiClient = new OpenAIClient(new OpenAIAuthentication(openAIKey));
var webrtcEndPoint = openaiClient.RealtimeEndpointWebRTC;
webrtcEndPoint.EnableDebug = true;
WindowsAudioEndPoint windowsAudioEP = new WindowsAudioEndPoint(webrtcEndPoint.AudioEncoder, -1, -1, false, false);
windowsAudioEP.SetAudioSinkFormat(webrtcEndPoint.AudioFormat);
windowsAudioEP.SetAudioSourceFormat(webrtcEndPoint.AudioFormat);
windowsAudioEP.OnAudioSourceEncodedSample += webrtcEndPoint.SendAudio;
webrtcEndPoint.OnRtpPacketReceived += (IPEndPoint rep, SDPMediaTypesEnum media, RTPPacket rtpPkt) =>
{
windowsAudioEP.GotAudioRtp(rep, rtpPkt.Header.SyncSource, rtpPkt.Header.SequenceNumber, rtpPkt.Header.Timestamp, rtpPkt.Header.PayloadType, rtpPkt.Header.MarkerBit == 1, rtpPkt.Payload);
};
webrtcEndPoint.OnPeerConnectionConnected += async () =>
{
await windowsAudioEP.StartAudio();
await windowsAudioEP.StartAudioSink();
};
webrtcEndPoint.OnPeerConnectionClosedOrFailed += async() => await windowsAudioEP.CloseAudio();
// This will get sent to OpenAI once the WebRTC connection is established. It updates the session
// that is automatically created by the OpenAI Realtime endpoint.
var sessionConfig = new SessionConfiguration(
OPENAI_MODEL,
voice: OPENAI_VOICE,
instructions: "Keep it snappy.",
tools: new List<Tool>());
var webrtcSession = await webrtcEndPoint.CreateSessionAsync(
sessionConfig,
rtcConfiguration: pcConfig);
// Get the conversation started.
var responseCreate = new CreateResponseRequest(new(instructions: "Say Hi."));
await webrtcSession.SendAsync(responseCreate);
Console.WriteLine("Wait for ctrl-c to indicate user exit.");
ManualResetEvent exitMre = new(false);
Console.CancelKeyPress += (_, e) =>
{
e.Cancel = true;
exitMre.Set();
};
exitMre.WaitOne();
}
} |
<PrivateAssets>all</PrivateAssets> | ||
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets> | ||
</PackageReference> | ||
<PackageReference Include="SIPSorcery" Version="8.0.14" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
prob need to remove this before publishing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It won't be possible to use WebRTC without it.
I do understand if you'd prefer to keep dependencies down and that was what I was getting at in the previous discussion.
The alternative would be a new separate package under RageAgainstThePixel or SIPSorcery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me dive into the specifics and see if there is a way to sort this out in a way that makes sense and is easy to use.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have an existing in a production project that uses SIPSorcery and this library, I'll fiddle with it to see what I can come up with while upgrading it from websockets to WebRTC
|
||
namespace OpenAI.Realtime | ||
{ | ||
public sealed class RealtimeSessionWebRTC : IDisposable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems very similar to the websocket implementation, I wonder if it can be generalized more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it can and it's the main reason the PR is in draft. Lot of reused code. But before refactoring I'd like to know whether adding the SIPSorcery dependency is your prefered approach or not...
Overall it looks good and clean. Mostly just curious how to handle tackling the deps w/o locking into a specific library. |
Yes, that's the core question. There's unlikely to be much point making a generic WebRTC interface since to the best of my knowledge there are no other dotnet alternatives that could be plugged in. So again, the question comes down to putting the WebRTC + OpenAI-DotNet into a completely separate package (which so far would be a total of two classes) maybe OpenAI-DotNet-WebRTC or just taking the hit on the dependcies and adding them to OpenAI-DotNet. |
var toolList = tools?.ToList(); | ||
|
||
if (string.IsNullOrWhiteSpace(toolChoice)) | ||
{ | ||
ToolChoice = "auto"; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is prob a good fix to make across all the tool usages. I'll check and see if this need to be ported to other feature areas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot suggested this was a bug and didn't handle the case correctly anymore:
The refactored control flow in toolChoice handling no longer assigns 'auto' when toolChoice is whitespace, as it did previously. Consider restoring that branch to ensure consistent behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That change means the ToolChoice will now always be assigned "auto" irrespsective of whether there are any tool entries or not. The original approach resulted in the request getting rejected by OpenAI if no tools were specified with an error tha the ToolChoice was not set.
Copilot is prob just missing that the original code still exists but has been moved outside the conditional.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds WebRTC support for realtime sessions and related endpoint functionality.
- Updated SessionConfiguration logic for tool choice handling.
- Introduced RealtimeSessionWebRTC and RealtimeEndpointWebRTC classes with complete SDP negotiation and event handling logic.
- Added new event response classes and updated client configuration, along with necessary package references.
Reviewed Changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.
Show a summary per file
File | Description |
---|---|
OpenAI-DotNet/Realtime/SessionConfiguration.cs | Modified tool choice handling logic |
OpenAI-DotNet/Realtime/RealtimeSessionWebRTC.cs | Added new realtime session implementation using WebRTC |
OpenAI-DotNet/Realtime/RealtimeEndpointWebRTC.cs | Added endpoint logic for establishing WebRTC sessions and SDP negotiation |
OpenAI-DotNet/Realtime/OutputAudioBufferStartedResponse.cs | Added new response event class |
OpenAI-DotNet/OpenAIClient.cs | Added property for the new RealtimeEndpointWebRTC |
OpenAI-DotNet/OpenAI-DotNet.csproj | Included new package reference to SIPSorcery |
OpenAI-DotNet/Extensions/RealtimeServerEventConverter.cs | Mapped new “output_audio_buffer.started” server event |
Comments suppressed due to low confidence (2)
OpenAI-DotNet/Realtime/RealtimeSessionWebRTC.cs:125
- Consider returning a Task instead of using 'async void' for the Send method to allow proper error propagation and handling.
public async void Send<T>(T @event) where T : IClientEvent => await SendAsync(@event).ConfigureAwait(false);
OpenAI-DotNet/Realtime/SessionConfiguration.cs:56
- The refactored control flow in toolChoice handling no longer assigns 'auto' when toolChoice is whitespace, as it did previously. Consider restoring that branch to ensure consistent behavior.
if (toolList is { Count: > 0 })
Hi, I managed to get this running, and I can't wait to use it in my projects. Excellent job! However, I must ask... is there any tool / function calling support implemented? I couldn't find any, and when I defined some for my test code, they weren't called. (Worse, the AI stopped and waited for the tool's answer). Also, I couldn't find a way to get transcription out of RealtimeSessionWebRTC (or the endpoint) - for either the user's or the AI's text. Perhaps I am missing something... Here's my code - the only actual change is the addition of the tools and the corresponding functions. using System.Net;
using SIPSorcery.Net;
using SIPSorceryMedia.Windows;
using OpenAI;
using OpenAI.Realtime;
namespace Demo;
class Program
{
private const string OPENAIKEY_ENVVAR = "OPENAI_API_KEY";
private const string OPENAI_MODEL = "gpt-4o-realtime-preview-2024-12-17";
private const string OPENAI_VOICE = "shimmer";
static async Task Main()
{
Console.WriteLine("WebRTC OpenAI Demo Program");
var openAIKey = Environment.GetEnvironmentVariable(OPENAIKEY_ENVVAR);
if (string.IsNullOrWhiteSpace(openAIKey))
{
Console.Error.WriteLine($"{OPENAIKEY_ENVVAR} environment variable not set, cannot continue.");
return;
}
var pcConfig = new RTCConfiguration
{
X_UseRtpFeedbackProfile = true
};
var openaiClient = new OpenAIClient(new OpenAIAuthentication(openAIKey));
var webrtcEndPoint = openaiClient.RealtimeEndpointWebRTC;
webrtcEndPoint.EnableDebug = true;
WindowsAudioEndPoint windowsAudioEP = new WindowsAudioEndPoint(webrtcEndPoint.AudioEncoder, -1, -1, false, false);
windowsAudioEP.SetAudioSinkFormat(webrtcEndPoint.AudioFormat);
windowsAudioEP.SetAudioSourceFormat(webrtcEndPoint.AudioFormat);
windowsAudioEP.OnAudioSourceEncodedSample += webrtcEndPoint.SendAudio;
webrtcEndPoint.OnRtpPacketReceived += (IPEndPoint rep, SDPMediaTypesEnum media, RTPPacket rtpPkt) =>
{
windowsAudioEP.GotAudioRtp(rep, rtpPkt.Header.SyncSource, rtpPkt.Header.SequenceNumber, rtpPkt.Header.Timestamp, rtpPkt.Header.PayloadType, rtpPkt.Header.MarkerBit == 1, rtpPkt.Payload);
};
webrtcEndPoint.OnPeerConnectionConnected += async () =>
{
await windowsAudioEP.StartAudio();
await windowsAudioEP.StartAudioSink();
};
webrtcEndPoint.OnPeerConnectionClosedOrFailed += async () => await windowsAudioEP.CloseAudio();
// This will get sent to OpenAI once the WebRTC connection is established. It updates the session
// that is automatically created by the OpenAI Realtime endpoint.
var sessionConfig = new SessionConfiguration(
OPENAI_MODEL,
voice: OPENAI_VOICE,
instructions: "Keep it snappy.",
tools:
[
Tool.FromFunc("Add", (int a, int b) => Add(a, b)),
Tool.FromFunc("Random", (int min, int max) => Random(min, max))
],
toolChoice: "auto"
);
var webrtcSession = await webrtcEndPoint.CreateSessionAsync(
sessionConfig,
rtcConfiguration: pcConfig);
// Get the conversation started.
var responseCreate = new CreateResponseRequest(new(instructions: "Say Hi."));
await webrtcSession.SendAsync(responseCreate);
Console.WriteLine("Wait for ctrl-c to indicate user exit.");
ManualResetEvent exitMre = new(false);
Console.CancelKeyPress += (_, e) =>
{
e.Cancel = true;
exitMre.Set();
};
exitMre.WaitOne();
}
public static int Add(int a, int b)
{
var result = a + b;
Console.WriteLine($"Add({a}, {b}): {result}");
return result;
}
public static int Random(int min, int max)
{
Random random = new Random();
Console.WriteLine($"Random({min}, {max}): {random}");
return random.Next(min, max);
}
}
|
Yeah I think we need to use the same unit tests from the Websocket implementation as a baseline since it tests tools and bi-directions client messages. |
There is nothing included in this PR (it was intended to see what the appetite for merging was rather than being fully formed). That being said I have done so initial mucking around with local functions and it does mostly work see demo here. There are a few challenges to deal with getting the calls right. The OpenAI docs don't deal with function calling over data channels at all and while it seems to be mostly equivalent to the HTTP approach there are some discrepancies. The while WebRTC peer connection will typically get cut dropped if a data channel message is sent to OpenAI that it doesn't understand or like. That could be what's happening in your case although as you don't seem to be sending any new messages I'm guessing it could be a delay to the initial session update, or just somethign else entirely. In my testing though the WebRTC connections too OpenAI are nice and fast and clean. Once the format of the data channel messages are determined my experience has been very stable.
Again not wired up in this PR and would take a bit of re-achitecting since it's based on the request/response approachused by the existing web sokcet implmentation. Apart from that it's super easy to do and and is sooo useful. It's just a matter of catching the required JSON message types in the datachannel and doing something with them. private void OnDataChannelMessage(RTCDataChannel dc, DataChannelPayloadProtocols protocol, byte[] data)
{
//logger.LogInformation($"Data channel {dc.label}, protocol {protocol} message length {data.Length}.");
var message = Encoding.UTF8.GetString(data);
var serverEvent = JsonSerializer.Deserialize<OpenAIServerEventBase>(message, JsonOptions.Default);
var serverEventModel = OpenAIDataChannelManager.ParseDataChannelMessage(data);
serverEventModel.IfSome(e =>
{
if (e is OpenAIResponseAudioTranscriptDone done)
{
_logger.LogInformation($"Transcript done: {done.Transcript}");
}
});
} Here's a transcript I use to blow up the AI by asking it to count to 30 in 3 diff langauges while switching between languages each number. It fails every time even after I tell it where it's going wrong.
|
Uh oh!
There was an error while loading. Please reload this page.