feat: WebSocket streaming with VAD, backpressure and Multilingual Support by OmPradnya · Pull Request #29 · susiai/susi_translator

OmPradnya · 2026-03-29T16:50:05Z

summary:

This PR adds a Speech-To-Text Websocket streaming endpoint('/stt/stream'). Production readiness is also achieved, improvements in the existing pipeline of HTTP

This PR adds:

1)Websocket streaming endpoint - /stt/stream
-client opens a websocket connection and receive transcript events as Whisper completes its each job (no polling needed)

Protocol:

Server → Client (on connect):
  {"type": "session", "session_id": "...", "tenant_id": "..."}

Client → Server:
  {"type": "set_chunk", "chunk_id": "<id>"}
  {"chunk_id": "<id>", "audio_b64": "<base64 PCM>"}   ← audio
  {"type": "finalize_chunk", "chunk_id": "<id>"}       ← request final

Server → Client (transcript events):
  {"session_id":…, "chunk_id":…, "text":…, "is_final": false}  ← interim
  {"session_id":…, "chunk_id":…, "text":…, "is_final": true}   ← final

Both JSON (base64) and raw binary (int16 LE PCM) audio are supported.

Server side Voice activity detection(VAD)

before enqueueing audio to whisper added energy based silence detection in streaming_stt_ws.py which results in: reduced whisper hallucinations on silence and saves CPU on noise live event audio

Multi worker STT pool with thread safety:

stt_config.py — STT_NUM_WORKERS (default 1 for CPU safety),
STT_MAX_QUEUE_SIZE (default 1000), STT_QUEUE_OVERFLOW_POLICY
(reject or drop_oldest)
stt_ingest.py — try_enqueue() with bounded queue and overflow
handling; monotonic timestamp stamped at enqueue for latency tracking
transcribe_server.py — global _transcript_lock and
_dequeue_coalesce_lock replace per-call threading.Lock()
(previous code created a new lock per call, providing no protection)

NOTE FOR CPU : STT_NUM_WORKERS=1 is kept as default for local. This is done to avoid KV ache thread collisions. Can be set to 2+ when using GPU

BACKPRESSURE:

HTTP POST /transcribe: rejected jobs now return
"status": "rejected" with "error": "queue_full" instead of
silently dropping
WebSocket: {"type": "error", "code": "queue_overload"} is sent to
client when queue is full
drop_oldest policy is available via env var for live streaming
scenarios where recency matters more than completeness

is_valid() - accuracy and multilingual fix

Previous filter blocked legitimate speech like you (Exact match), thank you(substring)
NOW only know whisper hallucinations are blocked :
{"thanks for watching", "click, click", "click click", "cough cough"}
Exact-match forbidden strings reduced to: {"eh.", "bye.", "it's fine"}
ASCII check removed — multilingual transcription now works correctly.

Latency measurement:
Every Whisper job logs:
stt_latency_ms= chunk_id= session_id= qsize= workers=

What was NOT changed

All existing HTTP endpoints (/transcribe, /get_transcript,
/get_first_transcript, /pop_first_transcript,
/get_latest_transcript, /pop_latest_transcript,
/delete_transcript, /list_transcripts, /transcripts_size)
All response shapes and query parameter names
Queue semantics and chunk_id / silence boundary contract
Django app — untouched
Whisper model loading and backend selection (WHISPER_SERVER_USE,
WHISPER_MODEL, WHISPER_SERVER)

New dependencies

flask-sock
simple-websocket
websocket-client  # test client only, not server dependency

Added to requirements.txt.

Environment variables (new)

| Variable | Default | Description |

Detailed changes in `transcribe_server.py`

This was the only file modified. All other files in this PR are new.

Imports added

flask_sock.Sock — WebSocket support
from websocket_manager import emit_stream_update
from stt_config import STT_NUM_WORKERS, STT_MAX_QUEUE_SIZE, STT_QUEUE_OVERFLOW_POLICY
from stt_ingest import try_enqueue

Thread safety — fixed critical bug

Before:

with threading.Lock():  # line 141 — creates a NEW lock every call

This created a brand new lock object on every call to process_audio,
meaning it never actually synchronized anything across threads.

After:

_transcript_lock = threading.Lock()   # one global lock
_dequeue_coalesce_lock = threading.Lock()  # protects queue coalescing
# all reads/writes to transcriptd now use: with _transcript_lock:

Queue — unbounded → bounded with overflow policy

Before:

audio_stack = queue.Queue()  # unbounded, no overflow handling

After:

audio_stack = queue.Queue(maxsize=STT_MAX_QUEUE_SIZE)  # bounded
# ingestion via try_enqueue() which handles reject / drop_oldest

Worker pool — single hardcoded thread → configurable

Before:

# __main__ only:
threading.Thread(target=process_audio).start()

After:

# ensure_process_audio_thread() starts STT_NUM_WORKERS daemon threads
# also hooked into @app.before_request so it works with flask run
# default STT_NUM_WORKERS=1 (safe for CPU / shared Whisper models)

Whisper server branch — fixed silent crash

Before: When use_whisper_server=true and qsize > 20, the code
posted to HTTP but never set result, then immediately tried
result['text'] → UnboundLocalError crash.

After: HTTP server path always builds result with a text field
before accessing it, for both load regimes.

is_valid() — multilingual and accuracy fix

Before: Blocked "you" (exact), "thank you" (substring),
all non-Latin scripts (ASCII-only check).

After: Only blocks known Whisper hallucinations. ASCII check
removed. Multilingual transcription works correctly.

WebSocket endpoint added

sock = Sock(app)

@sock.route("/stt/stream")
def stt_stream(ws):
    from streaming_stt_ws import run_stt_stream
    run_stt_stream(ws, request, ...)

Latency logging added

logger.info("stt_latency_ms=%.1f chunk_id=%s session_id=%s qsize=%s workers=%s", ...)

Timestamp captured at enqueue in stt_ingest.py so queue wait
time is included in the measurement.

debug mode and port

Before: app.run(host='0.0.0.0', port=5055, debug=True)
After: app.run(host='0.0.0.0', port=5055, debug=False)

Real-Time STT Architecture

A complete WebSocket-based real-time STT pipeline is implemented.
Audio is streamed from client to server continuously over a persistent
WebSocket connection, processed by Whisper, and transcripts are pushed
back to the client instantly — no polling required.

Current status:

Transport layer: Real-time (WebSocket streaming, fully working)
Inference layer: Hardware-constrained

The system is real-time capable inference latency is the only bottleneck and is purely a hardware constraint, not an architectural one.

PIPLE LINE BUILT:
Client streams audio → Server receives instantly →
Whisper processes → Result pushed back to client immediately

TESTS

I have checked all the endpoints, nothing has broken the core logic, all the HTTP endpoints are working, infact i have also created a additional test file: run_tests.py which have 9 TEST CASES.
The approach was the test file was tested before the feature was added where the 8th TEST FAILED( of the new WEBSOCKET feature), the same test file was tested after the new feature was added, all 9 TEST CASES PASSED.

Another Test Method:
To reproduce: python susi_translator/test/test_stt_stream_client.py --url ws://127.0.0.1:5055/stt/stream --wav test_audio.wav
where test_audio.wav is your audio file.
Requirements: 16kHz mono 16-bit WAV file.
Convert any audio with:
ffmpeg -i input.wav -ar 16000 -ac 1 -sample_fmt s16 output.wav
[PORT MAY DIFFER]

Noise Conditions

Tested against real world noise conditions:
-sudden loud sounds (SNEEZE)
-Rustling a wrapper infront of the mic
-natural speech in the presence of background noises

In all the above 3 conditions the pipeline transcribed accurate output. No degradation in output quality. Noise robustness confirmed.

Multilingual Support - Actually Verified

Removing the ASCII-only check in is_valid() unblocks Whisper's
native multilingual capability. Tested and verified:

OUTPUT IS FROM ACTUAL LOGS
For Hindi audio, output was: अपका नाम क्या है?
For Marathi audio, output was: Hello तुज़ा नाओ काय आहे?

Both transcribed correctly with VALID status in server logs.
The previous ASCII filter was silently dropping all non-Latin output.

@Orbiter would appreciate your review, especially on the:
-VAD threshold (currently 200)
-the STT_NUM_WORKERS default for production.
Should I proceed with documentation next?

…st suite

Copilot

Pull request overview

This PR introduces a real-time Speech-to-Text (STT) streaming capability via a new WebSocket endpoint (/stt/stream) in the Flask service, alongside queue backpressure controls, worker-pool configurability, and updates intended to improve transcript validity filtering (including multilingual support).

Changes:

Add /stt/stream WebSocket streaming flow with session tracking and server→client transcript events (plus basic VAD and queue-overload signaling).
Introduce bounded queue ingestion (try_enqueue) with overflow policies and configurable worker pool sizing via environment variables.
Add local regression / manual test scripts for HTTP endpoints and WebSocket streaming.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 21 comments.

Show a summary per file

File	Description
`flask/transcribe_server.py`	Adds worker-pool startup, bounded queue ingestion/backpressure, WS route wiring, locking changes, validity filter updates, and latency logging.
`flask/streaming_stt_ws.py`	Implements the `/stt/stream` WebSocket handler, protocol parsing, VAD for JSON audio, and queue-overload signaling.
`flask/websocket_manager.py`	Adds session registry and per-session outbound queues used to decouple worker threads from WS I/O.
`flask/stt_ingest.py`	Adds normalized enqueue tuples with monotonic timestamps and bounded-queue overflow handling.
`flask/stt_config.py`	Adds env-driven configuration for workers, queue size, and overflow policy.
`flask/stress_ingest_test.py`	Adds a small stress helper to exercise queue overflow behavior.
`test/run_tests.py`	Adds a regression test runner covering HTTP endpoints and the WS streaming endpoint.
`test/test_stt_stream_client.py`	Adds a manual WS client for streaming audio (fake or WAV-based).
`requirements.txt`	Adds server-side WebSocket dependencies (`flask-sock`, `simple-websocket`).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-21T22:11:58Z

+    p = argparse.ArgumentParser()
+    p.add_argument("--base", default="http://127.0.0.1:5040")
+    p.add_argument("--requests", type=int, default=30)


This helper defaults to port 5040 (--base default), but transcribe_server.py now runs on 5055 by default. As written, running it without --base will hit the wrong port. Update the default (and the example in the docstring) to match the server port or read it from env.

Copilot · 2026-04-21T22:11:59Z

            sentences = request.args.get('sentences', default='false') == 'true'
-            if sentences == 'true': t = merge_and_split_transcripts(t)
+            if sentences == 'true':
+                t = merge_and_split_transcripts(t)


Same sentences flag bug as above: sentences is a boolean but is compared to 'true', so sentence merging never runs. Use if sentences: (and ensure merge_and_split_transcripts handles the dict shape in transcriptd).

Copilot · 2026-04-21T22:11:59Z

            sentences = request.args.get('sentences', default='false') == 'true'
-            if sentences == 'true': t = merge_and_split_transcripts(t)
+            if sentences == 'true':
+                t = merge_and_split_transcripts(t)


Same sentences flag bug: sentences is computed as a boolean then compared to 'true', so merge/split is never applied. Use if sentences:.

Copilot · 2026-04-21T22:11:59Z

+            list = {k: v for k, v in t.items() if int(fromid) <= int(k) <= int(untilid)}
+            return jsonify(list)


Avoid using list as a variable name since it shadows Python’s built-in list type, making debugging and type tooling harder. Rename to something like transcripts_filtered or items.

Suggested change

list = {k: v for k, v in t.items() if int(fromid) <= int(k) <= int(untilid)}

return jsonify(list)

transcripts_filtered = {k: v for k, v in t.items() if int(fromid) <= int(k) <= int(untilid)}

return jsonify(transcripts_filtered)

Copilot · 2026-04-21T22:12:00Z

+    if msg_type == "ping":
+        out = session_out_queue(session_id)
+        if out:
+            out.put_nowait(json.dumps({"type": "pong", "session_id": session_id}))


out.put_nowait(...) can raise queue.Full when the per-session outbound queue is saturated (e.g., client spams pings or is slow to read). Right now that exception will bubble up and terminate the WS handler loop. Catch queue_mod.Full here (similar to _notify_queue_overload) to avoid disconnecting clients due to backpressure.

Suggested change

out.put_nowait(json.dumps({"type": "pong", "session_id": session_id}))

try:

out.put_nowait(json.dumps({"type": "pong", "session_id": session_id}))

except queue_mod.Full:

pass

Copilot · 2026-04-21T22:12:03Z

            sentences = request.args.get('sentences', default='false') == 'true'
-            if sentences == 'true': t = merge_and_split_transcripts(t)
+            if sentences == 'true':
+                t = merge_and_split_transcripts(t)


Same sentences flag bug: sentences is computed as a boolean then compared to 'true'. This prevents sentence mode from running. Use if sentences:.

Copilot · 2026-04-21T22:12:03Z

+            sentences = request.args.get('sentences', default='false') == 'true'
+            if sentences == 'true':
+                t = merge_and_split_transcripts(t)


Same sentences flag bug: sentences is a boolean but compared to 'true', so sentence mode never triggers. Use if sentences:.

Copilot · 2026-04-21T22:12:03Z

+            sentences = request.args.get('sentences', default='false') == 'true'
+            if sentences == 'true':
+                t = merge_and_split_transcripts(t)


Same sentences flag bug: sentences is a boolean but compared to 'true', so the flag is ignored. Use if sentences:.

Copilot · 2026-04-21T22:12:03Z

            transcripts = transcriptd[tenant_id]
-            to_delete = [chunk_id for chunk_id in transcripts if int(chunk_id) < two_hours_ago]
-            for chunk_id in to_delete:
+            old_chunks = [chunk_id for chunk_id in transcripts if int(chunk_id) < two_hours_ago]


clean_old_transcripts() assumes chunk_id is always numeric (int(chunk_id)), but chunk IDs come from client input (including WebSocket) and can be non-numeric. A non-numeric key will raise ValueError and break cleanup (and potentially other endpoints that do int(k)). Consider validating chunk_id at ingest time, or filtering with chunk_id.isdigit() before converting.

Suggested change

old_chunks = [chunk_id for chunk_id in transcripts if int(chunk_id) < two_hours_ago]

old_chunks = [

chunk_id

for chunk_id in transcripts

if (isinstance(chunk_id, int) or (isinstance(chunk_id, str) and chunk_id.isdigit()))

and int(chunk_id) < two_hours_ago

]

Copilot · 2026-04-21T22:12:03Z

+        # Step 2 — send set_chunk control message
+        await ws.send(json.dumps({
+            "type": "set_chunk",
+            "chunk_id": WS_CHUNK_ID,
+            "tenant_id": TENANT_ID
+        }))
+        await asyncio.sleep(0.2)
+
+        # Step 3 — send binary PCM chunks
+        for i in range(0, len(pcm), chunk_size):
+            await ws.send(pcm[i:i + chunk_size])
+            await asyncio.sleep(0.1)
+


The WS test never sends a finalize_chunk control message. Since the server only guarantees a final transcript on finalize, this test can be flaky (it relies on an interim emit_stream_update being produced and not dropped). Consider sending {"type":"finalize_chunk","chunk_id":...} after audio is sent and waiting specifically for is_final=true for that chunk.

feat: real-time STT WebSocket streaming with VAD, backpressure and te…

7cbabdb

…st suite

OmPradnya changed the title ~~feat: WebSocket streaming with VAD, backpressure and Multilingual Supportw~~ feat: WebSocket streaming with VAD, backpressure and Multilingual Support Mar 29, 2026

mariobehling requested a review from Copilot April 21, 2026 22:02

Copilot started reviewing on behalf of mariobehling April 21, 2026 22:03 View session

Copilot AI reviewed Apr 21, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: WebSocket streaming with VAD, backpressure and Multilingual Support#29

feat: WebSocket streaming with VAD, backpressure and Multilingual Support#29
OmPradnya wants to merge 1 commit intosusiai:masterfrom
OmPradnya:feature/realtime-stt-websocket

OmPradnya commented Mar 29, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Copilot AI Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

		list = {k: v for k, v in t.items() if int(fromid) <= int(k) <= int(untilid)}
		return jsonify(list)

Conversation

OmPradnya commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What was NOT changed

New dependencies

Environment variables (new)

Detailed changes in transcribe_server.py

Imports added

Thread safety — fixed critical bug

Queue — unbounded → bounded with overflow policy

Worker pool — single hardcoded thread → configurable

Whisper server branch — fixed silent crash

is_valid() — multilingual and accuracy fix

WebSocket endpoint added

Latency logging added

debug mode and port

Real-Time STT Architecture

TESTS

Noise Conditions

Multilingual Support - Actually Verified

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

OmPradnya commented Mar 29, 2026 •

edited

Loading

Detailed changes in `transcribe_server.py`