[GSoC 2026] Project #11: Hands-Free Multimodal Voice Mode (Technical Proposal & Prototype) #22470

Rithvickkr · 2026-03-14T19:59:27Z

Rithvickkr
Mar 14, 2026

[GSoC 2026] Architecture & Prototype: Hands-Free Multimodal Voice Mode (Project #11)

Hi @bdmorgan, @jacob314, and the Gemini CLI Team,

I am Rithvick Kumar, a final-year student at NIT Kurukshetra and an active developer in the AI/ML space. With a technical foundation in distributed systems and real-time streaming, My proposal for Project #11: Hands-Free Multimodal Voice Mode focuses on creating a seamless, interruptible voice interface for the terminal. I plan to implement the orchestration layer for low-latency bidirectional PCM streaming, alongside a robust 'barge-in' capability, ensuring that the Gemini CLI can not only hear the user but engage in a true, real-time multimodal dialogue.

While exploring packages/core and the @google/genai Live API SDK documentation, it became evident that simply maintaining an open audio WebSocket is not the primary technical hurdle. The true challenge lies in concurrent stream orchestration: handling non-deterministic out-of-order events, instantaneous user interruptions (barge-in), and concurrent tool invocations without corrupting the CLI's internal state.

To validate my approach before submitting the final proposal, I have built a standalone, production-grade orchestration engine in TypeScript.

Prototype Repository: https://github.com/Rithvickkr/Gemin-CLI-voice-mode-orchestrator/tree/master/Voice-Mode-Orchestrator

Deep Technical Architecture

My orchestration layer passes 192 strict mode tests and successfully simulates the three most critical voice scenarios (normal execution, standard barge-in, and mid-tool cancellation) without requiring an active API key.

The architecture consists of five core pillars:

1. Deterministic State Management (9-State FSM)

Standard voice implementations often default to a simplistic 4-state loop (Listening, Thinking, Speaking, Executing). However, real-world networks exhibit handshake latency and dropped packets. My architecture utilizes a strict 9-state Finite State Machine (IDLE, CONNECTING, LISTENING, SPEECH_DETECTED, PROCESSING, SPEAKING, INTERRUPTED, TOOL_EXECUTING, RECONNECTING, ERROR). Crucially, the transition logic operates via a mathematically pure resolve(state, event) function. This guarantees that stray WebSocket events arriving out-of-order cannot mutate the state incorrectly.

2. Context Synchronization (SessionBridge)

Voice mode should act as a continuous augmentation of the user's workflow, not an isolated session. The SessionBridge module facilitates bidirectional context handoffs. Prior to connecting the Live API socket, it condenses the active terminal text history (getHistory()) to bootstrap the voice session's memory. Conversely, when the voice session terminates, the bridge synthesizes a structured summary (e.g., "[Voice session summary - 10 turns, 3 tool calls]") and injects it back into the CLI history buffer.

3. AbortController-Gated Interruption (Barge-In)

When a user barges in, sending a turn_complete: false signal to the Live API is insufficient if the client is concurrently executing a long-running tool (e.g., a massive file system search or npm install). The InterruptHandler instantly halts the AudioDriver's playback queue and immediately triggers an AbortController attached to all in-flight tools. This prevents "ghost executions" of tools after the user has changed their intent. The interrupted audio transcript is committed with a precise cut-off marker to maintain an accurate internal log.

4. Resilient Network Polling (CircuitBreaker)

To handle transient connection failures gracefully, the connection layer sits behind a standard CircuitBreaker implementation. It transitions between CLOSED, OPEN, and HALF_OPEN states based on consecutive failure thresholds. This prevents the CLI from spamming the Live API with reconnection requests (thundering herd problem) when the user's local network drops out.

5. Pluggable Audio Fallbacks

Relying strictly on Native Node-API C++ bindings (such as naudiodon) introduces a high point of failure during npm install -g @google/gemini-cli for Windows users lacking build tools. My architecture defines an AudioDriver interface. The system attempts to dynamically load low-latency native bindings first, but automatically falls back to a SoxPipeDriver (spawning independent shell processes) if bindings are unavailable, guaranteeing cross-platform installation success.

I have mapped exactly how these modules interface with the existing ToolRegistry and GeminiClient abstractions in the gemini-cli-integration.ts file within the repository.

Feedback Wanted

I am looking for feedback from the maintainers on two specific architectural decisions:

Does the SessionBridge's method of condensing text history into the Live API context align with the team's long-term vision for multimodal session persistence?
Is the CircuitBreaker pattern acceptable for managing WebSocket disconnects in packages/core, or does the team prefer a simpler exponential backoff algorithm?

Thank you for your time and guidance. I am eager to iterate upon this architecture based on your feedback.

Best,
Rithvick Kumar

Rithvickkr · 2026-03-16T08:48:13Z

Rithvickkr Mar 16, 2026
Author

@Solventerritory just checked it , so there are only 4 ideas now?

bdmorgan · 2026-03-16T13:15:22Z

bdmorgan Mar 16, 2026
Maintainer

We were asked to prepare that list 2 months back. Our project is moving at a speed that many of those ideas are already underway (either through internal efforts or external contributions) or business strategy has changed. So this is very much in flux. We hope to align on actual projects this week and will go forward from there.

Rithvickkr · 2026-03-16T13:18:37Z

Rithvickkr Mar 16, 2026
Author

@bdmorgan Thank you for the clarification. I just wanted to ask one quick thing,are the projects currently listed (there are 4, with one labeled “3” twice) the final set of projects, or can some still be removed? I ask because we have to prepare our proposals accordingly

Rithvickkr · 2026-03-16T08:55:47Z

Rithvickkr
Mar 16, 2026
Author

Hey @psinha40898 , in need of your guidance again for this
as in the new idea list there are mentioned 3 projects but at the end there is one more labelled 3 , will that be in the final idea list or it will also get cancelled?

5 replies

psinha40898 Mar 16, 2026

Sorry I wouldn't know about GSOC.

It says "partial list" and the number 3 is definitely labeled twice so it looks like a WIP to me. From the documentation it looks like you are allowed to propose ideas?

I can try to get more information but I can't guarantee that I'll be super helpful here. I think bdmorgan or jacob would know more.

This comment was marked as spam.

Sign in to view

Rithvickkr Mar 16, 2026
Author

@bdmorgan @jacob314 Can you clarify the situation please ?

bdmorgan Mar 16, 2026
Maintainer

We are finalizing our actual projects this week and the mentors on the team (5 mentors, 5 projects) will be going through the selection/applicant process to select a student to work with. None of this work starts officially until June.

This comment was marked as spam.

Sign in to view

theagharshit · 2026-03-16T11:13:07Z

theagharshit
Mar 16, 2026

Subject: GSoC 2026 Interest and Introduction - Harshit Agrawal

Hi everyone,

I’m Harshit Agrawal, a 3rd-year Computer Science Engineering student at Delhi Technological University (DTU). I’ve been following the development of the Gemini CLI and am highly interested in contributing as part of GSoC 2026.

I have a strong interest in AI agentic workflows and local execution. I’ve already spent some time exploring the repository, specifically looking into how the agent handles task execution within packages/core and packages/cli. I’m particularly drawn to Project #2 (Behavioral Evaluation) because I believe robust benchmarking is critical for moving agents from experimental tools to reliable production assistants.

I’m curious if the team has specific priorities regarding the initial benchmark categories or if there are any existing "help wanted" issues that would be good for a new contributor to tackle to get a better feel for the agent's reasoning logic.

Looking forward to discussing this further!

Best regards,
Harshit Agrawal
GitHub: https://github.com/theagharshit

1 reply

theagharshit Mar 18, 2026

Hi @bdmorgan ,

I wanted to follow up on my previous message regarding GSoC 2026. I'm still very keen on contributing to Project #2 (Behavioral Evaluation) and have been digging deeper into the packages/core logic to understand how the agent currently handles multi-step reasoning.

To get a head start, I've started drafting a few edge-case scenarios for benchmarking. Are there any specific good first issues or architectural discussions you'd recommend I look into to ensure my proposal aligns with the team's current roadmap for evaluation?

Thanks for your time!

Best regards,
Harshit Agrawal

[GSoC 2026] Project #11: Hands-Free Multimodal Voice Mode (Technical Proposal & Prototype) #22470

Uh oh!

Rithvickkr Mar 14, 2026

[GSoC 2026] Architecture & Prototype: Hands-Free Multimodal Voice Mode (Project #11)

Deep Technical Architecture

1. Deterministic State Management (9-State FSM)

2. Context Synchronization (SessionBridge)

3. AbortController-Gated Interruption (Barge-In)

4. Resilient Network Polling (CircuitBreaker)

5. Pluggable Audio Fallbacks

Feedback Wanted

Replies: 3 comments · 9 replies

This comment was marked as spam.

Uh oh!

Rithvickkr Mar 16, 2026 Author

Uh oh!

bdmorgan Mar 16, 2026 Maintainer

Uh oh!

Uh oh!

Rithvickkr Mar 16, 2026 Author

Uh oh!

Rithvickkr Mar 16, 2026 Author

Uh oh!

psinha40898 Mar 16, 2026

This comment was marked as spam.

Uh oh!

Rithvickkr Mar 16, 2026 Author

Uh oh!

bdmorgan Mar 16, 2026 Maintainer

This comment was marked as spam.

Uh oh!

theagharshit Mar 16, 2026

Uh oh!

theagharshit Mar 18, 2026

Rithvickkr
Mar 14, 2026

Replies: 3 comments 9 replies

Rithvickkr Mar 16, 2026
Author

bdmorgan Mar 16, 2026
Maintainer

Rithvickkr Mar 16, 2026
Author

Rithvickkr
Mar 16, 2026
Author

Rithvickkr Mar 16, 2026
Author

bdmorgan Mar 16, 2026
Maintainer

theagharshit
Mar 16, 2026