feat: local silero vad and pre-speech ring buffer for telephony input by Gmin2 · Pull Request #656 · bolna-ai/bolna

Gmin2 · 2026-04-21T09:30:17Z

bolna has no in-process VAD today every "user is speaking" decision waits on the ASR providers speech_started over the network, so first syllables get clipped and barge-in pays a full round-trip. This adds an opt-in local silero VAD with a 500ms pre-speech ring buffer in TelephonyInputHandler, when enabled, audio is gated locally and speech events feed interruption_manager directly. set vad_config on the input section of the agent JSON to turn it on; off by default, byte-identical to main otherwise.

I tested this end-to-end against a real Deepgram session on a 6-second speech clip. With the local VAD on, we send 31.7% fewer bytes to Deepgram (48000 → 32800), the transcript and WER (0.083) come out identical, and Bolna learns the user is speaking at 1522ms instead of waiting until Deepgrams SpeechStarted at 2050ms about 527ms faster barge-in. Full harness and the little benchmarks can be seen at: 3b7ac0a

feat: local silero vad and pre-speech ring buffer for telephony input

28f0ec6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: local silero vad and pre-speech ring buffer for telephony input#656

feat: local silero vad and pre-speech ring buffer for telephony input#656
Gmin2 wants to merge 1 commit intobolna-ai:masterfrom
Gmin2:feature/local-vad

Gmin2 commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Gmin2 commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Gmin2 commented Apr 21, 2026 •

edited

Loading