[GSoC 2026] Project #11: Hands-Free Multimodal Voice Mode (Technical Proposal & Prototype) #22470
Replies: 3 comments 9 replies
This comment was marked as spam.
This comment was marked as spam.
-
|
Hey @psinha40898 , in need of your guidance again for this |
Beta Was this translation helpful? Give feedback.
-
|
Subject: GSoC 2026 Interest and Introduction - Harshit Agrawal Hi everyone, I’m Harshit Agrawal, a 3rd-year Computer Science Engineering student at Delhi Technological University (DTU). I’ve been following the development of the Gemini CLI and am highly interested in contributing as part of GSoC 2026. I have a strong interest in AI agentic workflows and local execution. I’ve already spent some time exploring the repository, specifically looking into how the agent handles task execution within packages/core and packages/cli. I’m particularly drawn to Project #2 (Behavioral Evaluation) because I believe robust benchmarking is critical for moving agents from experimental tools to reliable production assistants. I’m curious if the team has specific priorities regarding the initial benchmark categories or if there are any existing "help wanted" issues that would be good for a new contributor to tackle to get a better feel for the agent's reasoning logic. Looking forward to discussing this further! Best regards, |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
[GSoC 2026] Architecture & Prototype: Hands-Free Multimodal Voice Mode (Project #11)
Hi @bdmorgan, @jacob314, and the Gemini CLI Team,
I am Rithvick Kumar, a final-year student at NIT Kurukshetra and an active developer in the AI/ML space. With a technical foundation in distributed systems and real-time streaming, My proposal for Project #11: Hands-Free Multimodal Voice Mode focuses on creating a seamless, interruptible voice interface for the terminal. I plan to implement the orchestration layer for low-latency bidirectional PCM streaming, alongside a robust 'barge-in' capability, ensuring that the Gemini CLI can not only hear the user but engage in a true, real-time multimodal dialogue.
While exploring
packages/coreand the@google/genaiLive API SDK documentation, it became evident that simply maintaining an open audio WebSocket is not the primary technical hurdle. The true challenge lies in concurrent stream orchestration: handling non-deterministic out-of-order events, instantaneous user interruptions (barge-in), and concurrent tool invocations without corrupting the CLI's internal state.To validate my approach before submitting the final proposal, I have built a standalone, production-grade orchestration engine in TypeScript.
Prototype Repository: https://github.com/Rithvickkr/Gemin-CLI-voice-mode-orchestrator/tree/master/Voice-Mode-Orchestrator
Deep Technical Architecture
My orchestration layer passes 192 strict mode tests and successfully simulates the three most critical voice scenarios (normal execution, standard barge-in, and mid-tool cancellation) without requiring an active API key.
The architecture consists of five core pillars:
1. Deterministic State Management (9-State FSM)
Standard voice implementations often default to a simplistic 4-state loop (Listening, Thinking, Speaking, Executing). However, real-world networks exhibit handshake latency and dropped packets. My architecture utilizes a strict 9-state Finite State Machine (
IDLE,CONNECTING,LISTENING,SPEECH_DETECTED,PROCESSING,SPEAKING,INTERRUPTED,TOOL_EXECUTING,RECONNECTING,ERROR). Crucially, the transition logic operates via a mathematically pure resolve(state, event) function. This guarantees that stray WebSocket events arriving out-of-order cannot mutate the state incorrectly.2. Context Synchronization (SessionBridge)
Voice mode should act as a continuous augmentation of the user's workflow, not an isolated session. The SessionBridge module facilitates bidirectional context handoffs. Prior to connecting the Live API socket, it condenses the active terminal text history (
getHistory()) to bootstrap the voice session's memory. Conversely, when the voice session terminates, the bridge synthesizes a structured summary (e.g., "[Voice session summary - 10 turns, 3 tool calls]") and injects it back into the CLI history buffer.3. AbortController-Gated Interruption (Barge-In)
When a user barges in, sending a
turn_complete: falsesignal to the Live API is insufficient if the client is concurrently executing a long-running tool (e.g., a massive file system search or npm install). The InterruptHandler instantly halts the AudioDriver's playback queue and immediately triggers an AbortController attached to all in-flight tools. This prevents "ghost executions" of tools after the user has changed their intent. The interrupted audio transcript is committed with a precise cut-off marker to maintain an accurate internal log.4. Resilient Network Polling (CircuitBreaker)
To handle transient connection failures gracefully, the connection layer sits behind a standard CircuitBreaker implementation. It transitions between
CLOSED,OPEN, andHALF_OPENstates based on consecutive failure thresholds. This prevents the CLI from spamming the Live API with reconnection requests (thundering herd problem) when the user's local network drops out.5. Pluggable Audio Fallbacks
Relying strictly on Native Node-API C++ bindings (such as
naudiodon) introduces a high point of failure duringnpm install -g @google/gemini-clifor Windows users lacking build tools. My architecture defines an AudioDriver interface. The system attempts to dynamically load low-latency native bindings first, but automatically falls back to aSoxPipeDriver(spawning independent shell processes) if bindings are unavailable, guaranteeing cross-platform installation success.I have mapped exactly how these modules interface with the existing
ToolRegistryand GeminiClient abstractions in the gemini-cli-integration.ts file within the repository.Feedback Wanted
I am looking for feedback from the maintainers on two specific architectural decisions:
packages/core, or does the team prefer a simpler exponential backoff algorithm?Thank you for your time and guidance. I am eager to iterate upon this architecture based on your feedback.
Best,
Rithvick Kumar
Beta Was this translation helpful? Give feedback.
All reactions