Skip to content

feat: local silero vad and pre-speech ring buffer for telephony input#656

Open
Gmin2 wants to merge 1 commit intobolna-ai:masterfrom
Gmin2:feature/local-vad
Open

feat: local silero vad and pre-speech ring buffer for telephony input#656
Gmin2 wants to merge 1 commit intobolna-ai:masterfrom
Gmin2:feature/local-vad

Conversation

@Gmin2
Copy link
Copy Markdown

@Gmin2 Gmin2 commented Apr 21, 2026

bolna has no in-process VAD today every "user is speaking" decision waits on the ASR providers speech_started over the network, so first syllables get clipped and barge-in pays a full round-trip. This adds an opt-in local silero VAD with a 500ms pre-speech ring buffer in TelephonyInputHandler, when enabled, audio is gated locally and speech events feed interruption_manager directly. set vad_config on the input section of the agent JSON to turn it on; off by default, byte-identical to main otherwise.

I tested this end-to-end against a real Deepgram session on a 6-second speech clip. With the local VAD on, we send 31.7% fewer bytes to Deepgram (48000 → 32800), the transcript and WER (0.083) come out identical, and Bolna learns the user is speaking at 1522ms instead of waiting until Deepgrams SpeechStarted at 2050ms about 527ms faster barge-in. Full harness and the little benchmarks can be seen at: 3b7ac0a

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant