Skip to content

Conversation

sipsorcery
Copy link

@sipsorcery sipsorcery commented May 9, 2025

  • Added support for WebRTC Realtime Sessions

@sipsorcery
Copy link
Author

sipsorcery commented May 9, 2025

This PR is currently able to establish a conversation with the OpenAI Realtime WbRTC endpoint. Sample program below.

@StephenHodgson my main intent with the PR for now is to see if it's palatable to you? I haven't intruded too much into the existing code. It was necessary to do a tweak to the SessionConfiguration to avoid getting an error response from OpenAI.

Sample below needs the follwoing packages and a Windows specific target such as <TargetFramework>net8.0-windows10.0.17763.0</TargetFramework>.

dotnet add package SIPSorcery
dotnet add package SIPSorceryMedia.Windows

Update: Minor update to add missing response create call to get the conversation started. All working nicely now.

using System;
using System.Net;
using System.Threading;
using System.Threading.Tasks;
using SIPSorcery.Net;
using SIPSorceryMedia.Windows;
using OpenAI;
using OpenAI.Realtime;
using System.Collections.Generic;

namespace demo;

class Program
{
    private const string OPENAIKEY_ENVVAR = "OPENAIKEY";
    private const string OPENAI_MODEL = "gpt-4o-realtime-preview-2024-12-17";
    private const string OPENAI_VOICE = "shimmer";

    static async Task Main()
    {
        Console.WriteLine("WebRTC OpenAI Demo Program");

        var openAIKey = Environment.GetEnvironmentVariable(OPENAIKEY_ENVVAR);
        if(string.IsNullOrWhiteSpace(openAIKey) )
        {
            Console.Error.WriteLine($"{OPENAIKEY_ENVVAR} environment variable not set, cannot continue.");
            return;
        }

        var pcConfig = new RTCConfiguration
        {
            X_UseRtpFeedbackProfile = true
        };

        var openaiClient = new OpenAIClient(new OpenAIAuthentication(openAIKey));
        var webrtcEndPoint = openaiClient.RealtimeEndpointWebRTC;
        webrtcEndPoint.EnableDebug = true;

        WindowsAudioEndPoint windowsAudioEP = new WindowsAudioEndPoint(webrtcEndPoint.AudioEncoder, -1, -1, false, false);
        windowsAudioEP.SetAudioSinkFormat(webrtcEndPoint.AudioFormat);
        windowsAudioEP.SetAudioSourceFormat(webrtcEndPoint.AudioFormat);
        windowsAudioEP.OnAudioSourceEncodedSample += webrtcEndPoint.SendAudio;

        webrtcEndPoint.OnRtpPacketReceived += (IPEndPoint rep, SDPMediaTypesEnum media, RTPPacket rtpPkt) =>
        {
            windowsAudioEP.GotAudioRtp(rep, rtpPkt.Header.SyncSource, rtpPkt.Header.SequenceNumber, rtpPkt.Header.Timestamp, rtpPkt.Header.PayloadType, rtpPkt.Header.MarkerBit == 1, rtpPkt.Payload);
        };
        webrtcEndPoint.OnPeerConnectionConnected += async () =>
        {
            await windowsAudioEP.StartAudio();
            await windowsAudioEP.StartAudioSink();
        };
        webrtcEndPoint.OnPeerConnectionClosedOrFailed += async() => await windowsAudioEP.CloseAudio();

        // This will get sent to OpenAI once the WebRTC connection is established. It updates the session
        // that is automatically created by the OpenAI Realtime endpoint.
        var sessionConfig = new SessionConfiguration(
                OPENAI_MODEL,
                voice: OPENAI_VOICE,
                instructions: "Keep it snappy.",
                tools: new List<Tool>());

        var webrtcSession = await webrtcEndPoint.CreateSessionAsync(
            sessionConfig,
            rtcConfiguration: pcConfig);

        // Get the conversation started.
        var responseCreate = new CreateResponseRequest(new(instructions: "Say Hi."));
        await webrtcSession.SendAsync(responseCreate);

       Console.WriteLine("Wait for ctrl-c to indicate user exit.");

        ManualResetEvent exitMre = new(false);
        Console.CancelKeyPress += (_, e) =>
        {
            e.Cancel = true;
            exitMre.Set();
        };
        exitMre.WaitOne();
    }
}

@StephenHodgson StephenHodgson changed the title Initial webrtc cut. Two conversation working. Session update not triggering OpenAI-Dotnet 8.7.0 May 9, 2025
<PrivateAssets>all</PrivateAssets>
<IncludeAssets>runtime; build; native; contentfiles; analyzers; buildtransitive</IncludeAssets>
</PackageReference>
<PackageReference Include="SIPSorcery" Version="8.0.14" />
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob need to remove this before publishing.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It won't be possible to use WebRTC without it.

I do understand if you'd prefer to keep dependencies down and that was what I was getting at in the previous discussion.

The alternative would be a new separate package under RageAgainstThePixel or SIPSorcery.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me dive into the specifics and see if there is a way to sort this out in a way that makes sense and is easy to use.

Copy link
Member

@StephenHodgson StephenHodgson May 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have an existing in a production project that uses SIPSorcery and this library, I'll fiddle with it to see what I can come up with while upgrading it from websockets to WebRTC


namespace OpenAI.Realtime
{
public sealed class RealtimeSessionWebRTC : IDisposable
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems very similar to the websocket implementation, I wonder if it can be generalized more

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it can and it's the main reason the PR is in draft. Lot of reused code. But before refactoring I'd like to know whether adding the SIPSorcery dependency is your prefered approach or not...

@StephenHodgson StephenHodgson added this to the 8.7.0 milestone May 9, 2025
@StephenHodgson StephenHodgson linked an issue May 9, 2025 that may be closed by this pull request
@StephenHodgson
Copy link
Member

Overall it looks good and clean. Mostly just curious how to handle tackling the deps w/o locking into a specific library.

@sipsorcery
Copy link
Author

Overall it looks good and clean. Mostly just curious how to handle tackling the deps w/o locking into a specific library.

Yes, that's the core question.

There's unlikely to be much point making a generic WebRTC interface since to the best of my knowledge there are no other dotnet alternatives that could be plugged in.

So again, the question comes down to putting the WebRTC + OpenAI-DotNet into a completely separate package (which so far would be a total of two classes) maybe OpenAI-DotNet-WebRTC or just taking the hit on the dependcies and adding them to OpenAI-DotNet.

Comment on lines +49 to +54
var toolList = tools?.ToList();

if (string.IsNullOrWhiteSpace(toolChoice))
{
ToolChoice = "auto";
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is prob a good fix to make across all the tool usages. I'll check and see if this need to be ported to other feature areas

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot suggested this was a bug and didn't handle the case correctly anymore:

The refactored control flow in toolChoice handling no longer assigns 'auto' when toolChoice is whitespace, as it did previously. Consider restoring that branch to ensure consistent behavior.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That change means the ToolChoice will now always be assigned "auto" irrespsective of whether there are any tool entries or not. The original approach resulted in the request getting rejected by OpenAI if no tools were specified with an error tha the ToolChoice was not set.

Copilot is prob just missing that the original code still exists but has been moved outside the conditional.

@StephenHodgson StephenHodgson requested a review from Copilot May 10, 2025 18:19
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds WebRTC support for realtime sessions and related endpoint functionality.

  • Updated SessionConfiguration logic for tool choice handling.
  • Introduced RealtimeSessionWebRTC and RealtimeEndpointWebRTC classes with complete SDP negotiation and event handling logic.
  • Added new event response classes and updated client configuration, along with necessary package references.

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated no comments.

Show a summary per file
File Description
OpenAI-DotNet/Realtime/SessionConfiguration.cs Modified tool choice handling logic
OpenAI-DotNet/Realtime/RealtimeSessionWebRTC.cs Added new realtime session implementation using WebRTC
OpenAI-DotNet/Realtime/RealtimeEndpointWebRTC.cs Added endpoint logic for establishing WebRTC sessions and SDP negotiation
OpenAI-DotNet/Realtime/OutputAudioBufferStartedResponse.cs Added new response event class
OpenAI-DotNet/OpenAIClient.cs Added property for the new RealtimeEndpointWebRTC
OpenAI-DotNet/OpenAI-DotNet.csproj Included new package reference to SIPSorcery
OpenAI-DotNet/Extensions/RealtimeServerEventConverter.cs Mapped new “output_audio_buffer.started” server event
Comments suppressed due to low confidence (2)

OpenAI-DotNet/Realtime/RealtimeSessionWebRTC.cs:125

  • Consider returning a Task instead of using 'async void' for the Send method to allow proper error propagation and handling.
public async void Send<T>(T @event) where T : IClientEvent => await SendAsync(@event).ConfigureAwait(false);

OpenAI-DotNet/Realtime/SessionConfiguration.cs:56

  • The refactored control flow in toolChoice handling no longer assigns 'auto' when toolChoice is whitespace, as it did previously. Consider restoring that branch to ensure consistent behavior.
if (toolList is { Count: > 0 })

@vbandi
Copy link

vbandi commented May 10, 2025

Hi,

I managed to get this running, and I can't wait to use it in my projects. Excellent job!

However, I must ask... is there any tool / function calling support implemented? I couldn't find any, and when I defined some for my test code, they weren't called. (Worse, the AI stopped and waited for the tool's answer).

Also, I couldn't find a way to get transcription out of RealtimeSessionWebRTC (or the endpoint) - for either the user's or the AI's text.

Perhaps I am missing something...

Here's my code - the only actual change is the addition of the tools and the corresponding functions.

using System.Net;
using SIPSorcery.Net;
using SIPSorceryMedia.Windows;
using OpenAI;
using OpenAI.Realtime;

namespace Demo;

class Program
{
    private const string OPENAIKEY_ENVVAR = "OPENAI_API_KEY";
    private const string OPENAI_MODEL = "gpt-4o-realtime-preview-2024-12-17";
    private const string OPENAI_VOICE = "shimmer";

    static async Task Main()
    {
        Console.WriteLine("WebRTC OpenAI Demo Program");

        var openAIKey = Environment.GetEnvironmentVariable(OPENAIKEY_ENVVAR);
        if (string.IsNullOrWhiteSpace(openAIKey))
        {
            Console.Error.WriteLine($"{OPENAIKEY_ENVVAR} environment variable not set, cannot continue.");
            return;
        }

        var pcConfig = new RTCConfiguration
        {
            X_UseRtpFeedbackProfile = true
        };

        var openaiClient = new OpenAIClient(new OpenAIAuthentication(openAIKey));
        var webrtcEndPoint = openaiClient.RealtimeEndpointWebRTC;
        webrtcEndPoint.EnableDebug = true;

        WindowsAudioEndPoint windowsAudioEP = new WindowsAudioEndPoint(webrtcEndPoint.AudioEncoder, -1, -1, false, false);
        windowsAudioEP.SetAudioSinkFormat(webrtcEndPoint.AudioFormat);
        windowsAudioEP.SetAudioSourceFormat(webrtcEndPoint.AudioFormat);
        windowsAudioEP.OnAudioSourceEncodedSample += webrtcEndPoint.SendAudio;

        webrtcEndPoint.OnRtpPacketReceived += (IPEndPoint rep, SDPMediaTypesEnum media, RTPPacket rtpPkt) =>
        {
            windowsAudioEP.GotAudioRtp(rep, rtpPkt.Header.SyncSource, rtpPkt.Header.SequenceNumber, rtpPkt.Header.Timestamp, rtpPkt.Header.PayloadType, rtpPkt.Header.MarkerBit == 1, rtpPkt.Payload);
        };
        webrtcEndPoint.OnPeerConnectionConnected += async () =>
        {
            await windowsAudioEP.StartAudio();
            await windowsAudioEP.StartAudioSink();
        };
        webrtcEndPoint.OnPeerConnectionClosedOrFailed += async () => await windowsAudioEP.CloseAudio();



        // This will get sent to OpenAI once the WebRTC connection is established. It updates the session
        // that is automatically created by the OpenAI Realtime endpoint.
        var sessionConfig = new SessionConfiguration(
                OPENAI_MODEL,
                voice: OPENAI_VOICE,
                instructions: "Keep it snappy.",
                tools: 
                [
                    Tool.FromFunc("Add", (int a, int b) => Add(a, b)),
                    Tool.FromFunc("Random", (int min, int max) => Random(min, max))
                ],
                toolChoice: "auto"
                );


        var webrtcSession = await webrtcEndPoint.CreateSessionAsync(
            sessionConfig,
            rtcConfiguration: pcConfig);

        // Get the conversation started.
        var responseCreate = new CreateResponseRequest(new(instructions: "Say Hi."));
        await webrtcSession.SendAsync(responseCreate);

        Console.WriteLine("Wait for ctrl-c to indicate user exit.");

        ManualResetEvent exitMre = new(false);
        Console.CancelKeyPress += (_, e) =>
        {
            e.Cancel = true;
            exitMre.Set();
        };
        exitMre.WaitOne();
    }

    public static int Add(int a, int b)
    {
        var result = a + b;
        Console.WriteLine($"Add({a}, {b}): {result}");
        return result;
    }

    public static int Random(int min, int max)
    {
        Random random = new Random();
        Console.WriteLine($"Random({min}, {max}): {random}");
        return random.Next(min, max);
    }

}

@StephenHodgson
Copy link
Member

Here's my code - the only actual change is the addition of the tools and the corresponding functions.

Yeah I think we need to use the same unit tests from the Websocket implementation as a baseline since it tests tools and bi-directions client messages.

@sipsorcery
Copy link
Author

sipsorcery commented May 11, 2025

However, I must ask... is there any tool / function calling support implemented? I couldn't find any, and when I defined some for my test code, they weren't called. (Worse, the AI stopped and waited for the tool's answer).

There is nothing included in this PR (it was intended to see what the appetite for merging was rather than being fully formed). That being said I have done so initial mucking around with local functions and it does mostly work see demo here. There are a few challenges to deal with getting the calls right. The OpenAI docs don't deal with function calling over data channels at all and while it seems to be mostly equivalent to the HTTP approach there are some discrepancies.

The while WebRTC peer connection will typically get cut dropped if a data channel message is sent to OpenAI that it doesn't understand or like. That could be what's happening in your case although as you don't seem to be sending any new messages I'm guessing it could be a delay to the initial session update, or just somethign else entirely. In my testing though the WebRTC connections too OpenAI are nice and fast and clean. Once the format of the data channel messages are determined my experience has been very stable.

Also, I couldn't find a way to get transcription out of RealtimeSessionWebRTC (or the endpoint) - for either the user's or the AI's text.

Again not wired up in this PR and would take a bit of re-achitecting since it's based on the request/response approachused by the existing web sokcet implmentation.

Apart from that it's super easy to do and and is sooo useful. It's just a matter of catching the required JSON message types in the datachannel and doing something with them.

    private void OnDataChannelMessage(RTCDataChannel dc, DataChannelPayloadProtocols protocol, byte[] data)
    {
        //logger.LogInformation($"Data channel {dc.label}, protocol {protocol} message length {data.Length}.");

        var message = Encoding.UTF8.GetString(data);
        var serverEvent = JsonSerializer.Deserialize<OpenAIServerEventBase>(message, JsonOptions.Default);

        var serverEventModel = OpenAIDataChannelManager.ParseDataChannelMessage(data);
        serverEventModel.IfSome(e =>
        {
            if (e is OpenAIResponseAudioTranscriptDone done)
            {
                _logger.LogInformation($"Transcript done: {done.Transcript}");
            }
        });
    }

Here's a transcript I use to blow up the AI by asking it to count to 30 in 3 diff langauges while switching between languages each number. It fails every time even after I tell it where it's going wrong.

[16:39:54 INF] Transcript done: Hi, I'm ChatGPT, an AI developed by OpenAI. I can assist you with information, answer questions, and help with various tasks. How can I help you today?
[16:40:14 INF] Transcript done: Un, zwei, 三, vier, 五, fünf, 六, sieben, 八, neun, 十.
[16:40:37 INF] Transcript done: Un, eins, 一, deux, zwei, 二, trois, drei, 三, quatre, vier, 四, cinq, fünf, 五, six, sechs, 六, sept, sieben, 七, huit, acht, 八, neuf, neun, 九, dix.
[16:41:09 INF] Transcript done: Un, eins, 一, deux, zwei, 二, 三, trois, drei, 三, 四, quatre, vier, 四, 五, cinq, fünf, 五, 六, six, sechs, 六, sept, sieben, 七, huit, acht, 八, neuf, neun, 九, dix, zehn, 十.
[16:41:36 INF] Transcript done: It seems like I got a bit mixed up there. Let's try that again with just one language per number:
Un, eins, 一, deux, zwei, 二, trois, drei, 三, quatre, vier, 四, cinq, fünf, 五, six, sechs, 六, sept, sieben, 七, huit, acht, 八, neuf, neun, 九, dix, zehn, 十.
[16:42:14 INF] Transcript done: Sure, let's do that:

French: Un, deux, trois, quatre, cinq, six, sept, huit, neuf, dix.

German: Eins, zwei, drei, vier, fünf, sechs, sieben, acht, neun, zehn.

Chinese: 一, 二, 三, 四, 五, 六, 七, 八, 九, 十.

@StephenHodgson StephenHodgson changed the title OpenAI-Dotnet 8.7.0 OpenAI-Dotnet 8.x May 23, 2025
@StephenHodgson StephenHodgson modified the milestones: 8.7.0, 8.x Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

Add Realtime WebRTC Support

3 participants