OpenAI-Dotnet 8.x #446

sipsorcery · 2025-05-09T22:12:45Z

Added support for WebRTC Realtime Sessions

…gering.

sipsorcery · 2025-05-09T22:26:19Z

This PR is currently able to establish a conversation with the OpenAI Realtime WbRTC endpoint. Sample program below.

@StephenHodgson my main intent with the PR for now is to see if it's palatable to you? I haven't intruded too much into the existing code. It was necessary to do a tweak to the SessionConfiguration to avoid getting an error response from OpenAI.

Sample below needs the follwoing packages and a Windows specific target such as <TargetFramework>net8.0-windows10.0.17763.0</TargetFramework>.

dotnet add package SIPSorcery
dotnet add package SIPSorceryMedia.Windows

Update: Minor update to add missing response create call to get the conversation started. All working nicely now.

using System;
using System.Net;
using System.Threading;
using System.Threading.Tasks;
using SIPSorcery.Net;
using SIPSorceryMedia.Windows;
using OpenAI;
using OpenAI.Realtime;
using System.Collections.Generic;

namespace demo;

class Program
{
    private const string OPENAIKEY_ENVVAR = "OPENAIKEY";
    private const string OPENAI_MODEL = "gpt-4o-realtime-preview-2024-12-17";
    private const string OPENAI_VOICE = "shimmer";

    static async Task Main()
    {
        Console.WriteLine("WebRTC OpenAI Demo Program");

        var openAIKey = Environment.GetEnvironmentVariable(OPENAIKEY_ENVVAR);
        if(string.IsNullOrWhiteSpace(openAIKey) )
        {
            Console.Error.WriteLine($"{OPENAIKEY_ENVVAR} environment variable not set, cannot continue.");
            return;
        }

        var pcConfig = new RTCConfiguration
        {
            X_UseRtpFeedbackProfile = true
        };

        var openaiClient = new OpenAIClient(new OpenAIAuthentication(openAIKey));
        var webrtcEndPoint = openaiClient.RealtimeEndpointWebRTC;
        webrtcEndPoint.EnableDebug = true;

        WindowsAudioEndPoint windowsAudioEP = new WindowsAudioEndPoint(webrtcEndPoint.AudioEncoder, -1, -1, false, false);
        windowsAudioEP.SetAudioSinkFormat(webrtcEndPoint.AudioFormat);
        windowsAudioEP.SetAudioSourceFormat(webrtcEndPoint.AudioFormat);
        windowsAudioEP.OnAudioSourceEncodedSample += webrtcEndPoint.SendAudio;

        webrtcEndPoint.OnRtpPacketReceived += (IPEndPoint rep, SDPMediaTypesEnum media, RTPPacket rtpPkt) =>
        {
            windowsAudioEP.GotAudioRtp(rep, rtpPkt.Header.SyncSource, rtpPkt.Header.SequenceNumber, rtpPkt.Header.Timestamp, rtpPkt.Header.PayloadType, rtpPkt.Header.MarkerBit == 1, rtpPkt.Payload);
        };
        webrtcEndPoint.OnPeerConnectionConnected += async () =>
        {
            await windowsAudioEP.StartAudio();
            await windowsAudioEP.StartAudioSink();
        };
        webrtcEndPoint.OnPeerConnectionClosedOrFailed += async() => await windowsAudioEP.CloseAudio();

        // This will get sent to OpenAI once the WebRTC connection is established. It updates the session
        // that is automatically created by the OpenAI Realtime endpoint.
        var sessionConfig = new SessionConfiguration(
                OPENAI_MODEL,
                voice: OPENAI_VOICE,
                instructions: "Keep it snappy.",
                tools: new List<Tool>());

        var webrtcSession = await webrtcEndPoint.CreateSessionAsync(
            sessionConfig,
            rtcConfiguration: pcConfig);

        // Get the conversation started.
        var responseCreate = new CreateResponseRequest(new(instructions: "Say Hi."));
        await webrtcSession.SendAsync(responseCreate);

       Console.WriteLine("Wait for ctrl-c to indicate user exit.");

        ManualResetEvent exitMre = new(false);
        Console.CancelKeyPress += (_, e) =>
        {
            e.Cancel = true;
            exitMre.Set();
        };
        exitMre.WaitOne();
    }
}

StephenHodgson · 2025-05-09T22:28:55Z

OpenAI-DotNet/OpenAI-DotNet.csproj

      <PrivateAssets>all</PrivateAssets>
      <IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
    </PackageReference>
+    <PackageReference Include="SIPSorcery" Version="8.0.14" />


prob need to remove this before publishing.

It won't be possible to use WebRTC without it.

I do understand if you'd prefer to keep dependencies down and that was what I was getting at in the previous discussion.

The alternative would be a new separate package under RageAgainstThePixel or SIPSorcery.

Let me dive into the specifics and see if there is a way to sort this out in a way that makes sense and is easy to use.

I have an existing in a production project that uses SIPSorcery and this library, I'll fiddle with it to see what I can come up with while upgrading it from websockets to WebRTC

OpenAI-DotNet/Realtime/RealtimeEndpointWebRTC.cs

StephenHodgson · 2025-05-09T22:32:24Z

OpenAI-DotNet/Realtime/RealtimeSessionWebRTC.cs

+
+namespace OpenAI.Realtime
+{
+    public sealed class RealtimeSessionWebRTC : IDisposable


seems very similar to the websocket implementation, I wonder if it can be generalized more

Yes, it can and it's the main reason the PR is in draft. Lot of reused code. But before refactoring I'd like to know whether adding the SIPSorcery dependency is your prefered approach or not...

OpenAI-DotNet/OpenAI-DotNet.csproj

OpenAI-DotNet/Realtime/SessionConfiguration.cs

OpenAI-DotNet/OpenAIClient.cs

StephenHodgson · 2025-05-09T22:44:37Z

Overall it looks good and clean. Mostly just curious how to handle tackling the deps w/o locking into a specific library.

sipsorcery · 2025-05-09T22:55:20Z

Overall it looks good and clean. Mostly just curious how to handle tackling the deps w/o locking into a specific library.

Yes, that's the core question.

There's unlikely to be much point making a generic WebRTC interface since to the best of my knowledge there are no other dotnet alternatives that could be plugged in.

So again, the question comes down to putting the WebRTC + OpenAI-DotNet into a completely separate package (which so far would be a total of two classes) maybe OpenAI-DotNet-WebRTC or just taking the hit on the dependcies and adding them to OpenAI-DotNet.

StephenHodgson · 2025-05-10T17:29:30Z

OpenAI-DotNet/Realtime/SessionConfiguration.cs

+            var toolList = tools?.ToList();
+
+            if (string.IsNullOrWhiteSpace(toolChoice))
+            {
+                ToolChoice = "auto";
+            }


This is prob a good fix to make across all the tool usages. I'll check and see if this need to be ported to other feature areas

Copilot suggested this was a bug and didn't handle the case correctly anymore:

The refactored control flow in toolChoice handling no longer assigns 'auto' when toolChoice is whitespace, as it did previously. Consider restoring that branch to ensure consistent behavior.

That change means the ToolChoice will now always be assigned "auto" irrespsective of whether there are any tool entries or not. The original approach resulted in the request getting rejected by OpenAI if no tools were specified with an error tha the ToolChoice was not set.

Copilot is prob just missing that the original code still exists but has been moved outside the conditional.

Copilot

Pull Request Overview

This PR adds WebRTC support for realtime sessions and related endpoint functionality.

Updated SessionConfiguration logic for tool choice handling.
Introduced RealtimeSessionWebRTC and RealtimeEndpointWebRTC classes with complete SDP negotiation and event handling logic.
Added new event response classes and updated client configuration, along with necessary package references.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
OpenAI-DotNet/Realtime/SessionConfiguration.cs	Modified tool choice handling logic
OpenAI-DotNet/Realtime/RealtimeSessionWebRTC.cs	Added new realtime session implementation using WebRTC
OpenAI-DotNet/Realtime/RealtimeEndpointWebRTC.cs	Added endpoint logic for establishing WebRTC sessions and SDP negotiation
OpenAI-DotNet/Realtime/OutputAudioBufferStartedResponse.cs	Added new response event class
OpenAI-DotNet/OpenAIClient.cs	Added property for the new RealtimeEndpointWebRTC
OpenAI-DotNet/OpenAI-DotNet.csproj	Included new package reference to SIPSorcery
OpenAI-DotNet/Extensions/RealtimeServerEventConverter.cs	Mapped new “output_audio_buffer.started” server event

Comments suppressed due to low confidence (2)

OpenAI-DotNet/Realtime/RealtimeSessionWebRTC.cs:125

Consider returning a Task instead of using 'async void' for the Send method to allow proper error propagation and handling.

public async void Send<T>(T @event) where T : IClientEvent => await SendAsync(@event).ConfigureAwait(false);

OpenAI-DotNet/Realtime/SessionConfiguration.cs:56

The refactored control flow in toolChoice handling no longer assigns 'auto' when toolChoice is whitespace, as it did previously. Consider restoring that branch to ensure consistent behavior.

if (toolList is { Count: > 0 })

vbandi · 2025-05-10T23:41:53Z

Hi,

I managed to get this running, and I can't wait to use it in my projects. Excellent job!

However, I must ask... is there any tool / function calling support implemented? I couldn't find any, and when I defined some for my test code, they weren't called. (Worse, the AI stopped and waited for the tool's answer).

Also, I couldn't find a way to get transcription out of RealtimeSessionWebRTC (or the endpoint) - for either the user's or the AI's text.

Perhaps I am missing something...

Here's my code - the only actual change is the addition of the tools and the corresponding functions.

using System.Net;
using SIPSorcery.Net;
using SIPSorceryMedia.Windows;
using OpenAI;
using OpenAI.Realtime;

namespace Demo;

class Program
{
    private const string OPENAIKEY_ENVVAR = "OPENAI_API_KEY";
    private const string OPENAI_MODEL = "gpt-4o-realtime-preview-2024-12-17";
    private const string OPENAI_VOICE = "shimmer";

    static async Task Main()
    {
        Console.WriteLine("WebRTC OpenAI Demo Program");

        var openAIKey = Environment.GetEnvironmentVariable(OPENAIKEY_ENVVAR);
        if (string.IsNullOrWhiteSpace(openAIKey))
        {
            Console.Error.WriteLine($"{OPENAIKEY_ENVVAR} environment variable not set, cannot continue.");
            return;
        }

        var pcConfig = new RTCConfiguration
        {
            X_UseRtpFeedbackProfile = true
        };

        var openaiClient = new OpenAIClient(new OpenAIAuthentication(openAIKey));
        var webrtcEndPoint = openaiClient.RealtimeEndpointWebRTC;
        webrtcEndPoint.EnableDebug = true;

        WindowsAudioEndPoint windowsAudioEP = new WindowsAudioEndPoint(webrtcEndPoint.AudioEncoder, -1, -1, false, false);
        windowsAudioEP.SetAudioSinkFormat(webrtcEndPoint.AudioFormat);
        windowsAudioEP.SetAudioSourceFormat(webrtcEndPoint.AudioFormat);
        windowsAudioEP.OnAudioSourceEncodedSample += webrtcEndPoint.SendAudio;

        webrtcEndPoint.OnRtpPacketReceived += (IPEndPoint rep, SDPMediaTypesEnum media, RTPPacket rtpPkt) =>
        {
            windowsAudioEP.GotAudioRtp(rep, rtpPkt.Header.SyncSource, rtpPkt.Header.SequenceNumber, rtpPkt.Header.Timestamp, rtpPkt.Header.PayloadType, rtpPkt.Header.MarkerBit == 1, rtpPkt.Payload);
        };
        webrtcEndPoint.OnPeerConnectionConnected += async () =>
        {
            await windowsAudioEP.StartAudio();
            await windowsAudioEP.StartAudioSink();
        };
        webrtcEndPoint.OnPeerConnectionClosedOrFailed += async () => await windowsAudioEP.CloseAudio();



        // This will get sent to OpenAI once the WebRTC connection is established. It updates the session
        // that is automatically created by the OpenAI Realtime endpoint.
        var sessionConfig = new SessionConfiguration(
                OPENAI_MODEL,
                voice: OPENAI_VOICE,
                instructions: "Keep it snappy.",
                tools: 
                [
                    Tool.FromFunc("Add", (int a, int b) => Add(a, b)),
                    Tool.FromFunc("Random", (int min, int max) => Random(min, max))
                ],
                toolChoice: "auto"
                );


        var webrtcSession = await webrtcEndPoint.CreateSessionAsync(
            sessionConfig,
            rtcConfiguration: pcConfig);

        // Get the conversation started.
        var responseCreate = new CreateResponseRequest(new(instructions: "Say Hi."));
        await webrtcSession.SendAsync(responseCreate);

        Console.WriteLine("Wait for ctrl-c to indicate user exit.");

        ManualResetEvent exitMre = new(false);
        Console.CancelKeyPress += (_, e) =>
        {
            e.Cancel = true;
            exitMre.Set();
        };
        exitMre.WaitOne();
    }

    public static int Add(int a, int b)
    {
        var result = a + b;
        Console.WriteLine($"Add({a}, {b}): {result}");
        return result;
    }

    public static int Random(int min, int max)
    {
        Random random = new Random();
        Console.WriteLine($"Random({min}, {max}): {random}");
        return random.Next(min, max);
    }

}

StephenHodgson · 2025-05-11T01:57:42Z

Here's my code - the only actual change is the addition of the tools and the corresponding functions.

Yeah I think we need to use the same unit tests from the Websocket implementation as a baseline since it tests tools and bi-directions client messages.

sipsorcery · 2025-05-11T16:30:40Z

However, I must ask... is there any tool / function calling support implemented? I couldn't find any, and when I defined some for my test code, they weren't called. (Worse, the AI stopped and waited for the tool's answer).

There is nothing included in this PR (it was intended to see what the appetite for merging was rather than being fully formed). That being said I have done so initial mucking around with local functions and it does mostly work see demo here. There are a few challenges to deal with getting the calls right. The OpenAI docs don't deal with function calling over data channels at all and while it seems to be mostly equivalent to the HTTP approach there are some discrepancies.

The while WebRTC peer connection will typically get cut dropped if a data channel message is sent to OpenAI that it doesn't understand or like. That could be what's happening in your case although as you don't seem to be sending any new messages I'm guessing it could be a delay to the initial session update, or just somethign else entirely. In my testing though the WebRTC connections too OpenAI are nice and fast and clean. Once the format of the data channel messages are determined my experience has been very stable.

Also, I couldn't find a way to get transcription out of RealtimeSessionWebRTC (or the endpoint) - for either the user's or the AI's text.

Again not wired up in this PR and would take a bit of re-achitecting since it's based on the request/response approachused by the existing web sokcet implmentation.

Apart from that it's super easy to do and and is sooo useful. It's just a matter of catching the required JSON message types in the datachannel and doing something with them.

    private void OnDataChannelMessage(RTCDataChannel dc, DataChannelPayloadProtocols protocol, byte[] data)
    {
        //logger.LogInformation($"Data channel {dc.label}, protocol {protocol} message length {data.Length}.");

        var message = Encoding.UTF8.GetString(data);
        var serverEvent = JsonSerializer.Deserialize<OpenAIServerEventBase>(message, JsonOptions.Default);

        var serverEventModel = OpenAIDataChannelManager.ParseDataChannelMessage(data);
        serverEventModel.IfSome(e =>
        {
            if (e is OpenAIResponseAudioTranscriptDone done)
            {
                _logger.LogInformation($"Transcript done: {done.Transcript}");
            }
        });
    }

Here's a transcript I use to blow up the AI by asking it to count to 30 in 3 diff langauges while switching between languages each number. It fails every time even after I tell it where it's going wrong.

[16:39:54 INF] Transcript done: Hi, I'm ChatGPT, an AI developed by OpenAI. I can assist you with information, answer questions, and help with various tasks. How can I help you today?
[16:40:14 INF] Transcript done: Un, zwei, 三, vier, 五, fünf, 六, sieben, 八, neun, 十.
[16:40:37 INF] Transcript done: Un, eins, 一, deux, zwei, 二, trois, drei, 三, quatre, vier, 四, cinq, fünf, 五, six, sechs, 六, sept, sieben, 七, huit, acht, 八, neuf, neun, 九, dix.
[16:41:09 INF] Transcript done: Un, eins, 一, deux, zwei, 二, 三, trois, drei, 三, 四, quatre, vier, 四, 五, cinq, fünf, 五, 六, six, sechs, 六, sept, sieben, 七, huit, acht, 八, neuf, neun, 九, dix, zehn, 十.
[16:41:36 INF] Transcript done: It seems like I got a bit mixed up there. Let's try that again with just one language per number:
Un, eins, 一, deux, zwei, 二, trois, drei, 三, quatre, vier, 四, cinq, fünf, 五, six, sechs, 六, sept, sieben, 七, huit, acht, 八, neuf, neun, 九, dix, zehn, 十.
[16:42:14 INF] Transcript done: Sure, let's do that:

French: Un, deux, trois, quatre, cinq, six, sept, huit, neuf, dix.

German: Eins, zwei, drei, vier, fünf, sechs, sieben, acht, neun, zehn.

Chinese: 一, 二, 三, 四, 五, 六, 七, 八, 九, 十.

sipsorcery added 2 commits May 9, 2025 23:11

Initial webrtc cut. Two conversation working. Session update not trig…

4537219

…gering.

Use nuget package instead of local sipsorcery reference.

84d6a49

StephenHodgson changed the title ~~Initial webrtc cut. Two conversation working. Session update not triggering~~ OpenAI-Dotnet 8.7.0 May 9, 2025