Docs/fix dotenv and hardware detection#36
Conversation
There was a problem hiding this comment.
Pull request overview
This PR aims to improve local configurability/portability for the transcription server by formalizing dependencies (requirements + pyproject.toml) and updating the Flask transcription pipeline and clients.
Changes:
- Replaced the minimal
requirements.txtwith a fully specified dependency list and added apyproject.tomlforuv/packaging workflows. - Updated
flask/transcribe_server.pyto handle JSON POSTs (and a streaming fallback), adjusted whisper-server request logic, and cleaned up transcript endpoints. - Updated audio grabber client(s) UI and Python sender behavior and refreshed README setup instructions.
Reviewed changes
Copilot reviewed 6 out of 7 changed files in this pull request and generated 11 comments.
Show a summary per file
| File | Description |
|---|---|
| requirements.txt | Replaces legacy dependency list with pinned/minimum versions and adds python-dotenv. |
| pyproject.toml | Introduces PEP 621 project metadata + dependencies for uv sync/packaging. |
| flask/transcribe_server.py | Refactors transcription request handling and transcript retrieval logic; adjusts server startup. |
| flask/transcribe_listener.html | Minor HTML formatting change. |
| flask/audio_grabber.py | Changes sending behavior and updates the transcribe endpoint port. |
| flask/audio_grabber.html | Adds backend selector + CSRF handling; UI restructure. |
| README.md | Updates setup instructions and documents pip vs uv workflows. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| app.run(host='0.0.0.0', port=5040, debug=True) | ||
| if __name__ == '__main__': | ||
| threading.Thread(target=process_audio, daemon=True).start() | ||
| app.run(host='0.0.0.0', port=5040, debug=False) No newline at end of file |
There was a problem hiding this comment.
The Flask server is still hardcoded to run on port 5040, but the PR description says the default should be 5000 / configurable via FLASK_PORT, and audio_grabber.py now posts to 5000. This will prevent the client from reaching the server unless the ports are aligned. Make the port configurable (e.g., read FLASK_PORT) and ensure all clients/default UI values match the server default.
| app.run(host='0.0.0.0', port=5040, debug=False) | |
| port_value = os.getenv('FLASK_PORT', '5000') | |
| try: | |
| flask_port = int(port_value) | |
| except ValueError: | |
| logger.warning("Invalid FLASK_PORT '%s'; falling back to 5000", port_value) | |
| flask_port = 5000 | |
| app.run(host='0.0.0.0', port=flask_port, debug=False) |
| # we either use a local in-code model or access a whisper.cpp server | ||
| use_whisper_server = os.getenv('WHISPER_SERVER_USE', 'false') == 'true' | ||
| #model_name = os.getenv('WHISPER_MODEL', 'tiny') # 39M | ||
| #model_name = os.getenv('WHISPER_MODEL', 'base') # 74M | ||
| model_fast_name = os.getenv('WHISPER_MODEL', 'small') # 244M | ||
| model_smart_name = os.getenv('WHISPER_MODEL', 'medium') # 769M | ||
| #model_name = os.getenv('WHISPER_MODEL', 'large-v3') # 1550M | ||
| model_fast_name = os.getenv('WHISPER_MODEL_FAST', 'small') # 244M | ||
| model_smart_name = os.getenv('WHISPER_MODEL_SMART', 'medium') # 769M | ||
|
|
There was a problem hiding this comment.
The PR description mentions loading a .env file (python-dotenv) and GPU auto-detection/logging, but transcribe_server.py does not call load_dotenv() and there is no device selection/logging (no torch.cuda checks). Either implement these startup behaviors or adjust the PR description/docs to match what the code actually does.
| def clean_old_transcripts(): | ||
| current_time = int(time.time() * 1000) # Current time in milliseconds | ||
| two_hours_ago = current_time - (2 * 60 * 60 * 1000) # Two hours ago in milliseconds | ||
| current_time = int(time.time() * 1000) | ||
| two_hours_ago = current_time - (2 * 60 * 60 * 1000) | ||
| with threading.Lock(): | ||
| # make a list of tenant_ids to delete | ||
| to_delete = [] | ||
| # iterate over all dictionaries in transcriptd | ||
| for tenant_id in transcriptd.keys(): | ||
| transcripts = transcriptd[tenant_id] | ||
| to_delete = [chunk_id for chunk_id in transcripts if int(chunk_id) < two_hours_ago] | ||
| for chunk_id in to_delete: | ||
| del transcripts[chunk_id] | ||
| # its possible that the tenant_id has no more transcripts | ||
| if len(transcripts) == 0: | ||
| to_delete.append(tenant_id) | ||
|
|
||
| # delete the tenant_ids | ||
| for tenant_id in to_delete: | ||
| del transcriptd[tenant_id] |
There was a problem hiding this comment.
clean_old_transcripts() has two issues: (1) it also uses with threading.Lock(): which is ineffective because it creates a new lock instance; and (2) to_delete is reused for chunk_ids and then later treated as tenant_ids, so del transcriptd[tenant_id] can attempt to delete non-existent keys (chunk IDs) and raise KeyError. Use separate collections (e.g., chunks_to_delete per tenant and tenants_to_delete overall) and delete tenants based on that second list while holding a shared lock.
| def merge_and_split_transcripts(transcripts): | ||
| # Iterate through the sorted transcript keys. | ||
| sec = ".!?" | ||
| merged_transcripts = "" | ||
| result = {} | ||
| for key in transcripts.keys(): | ||
| if not merged_transcripts: | ||
| # If merged_transcripts is empty, start with the first transcript. | ||
| merged_transcripts += transcripts[key].strip() | ||
| else: | ||
| # Append the transcript to the merged string with a space and lowercase the following first character. | ||
| t = transcripts[key].strip() | ||
| if len(t) > 1: | ||
| merged_transcripts += " " + t[0].lower() + t[1:] | ||
| merged_transcripts += " " + t[0].lower() + t[1:] | ||
| else: | ||
| merged_transcripts += " " + t |
There was a problem hiding this comment.
merge_and_split_transcripts() calls transcripts[key].strip(), but transcriptd stores entries as dicts like {'transcript': ...}. With the current storage format, this function will raise an AttributeError when sentences=true. Update it to read/write the nested transcript field (or change transcriptd to store plain strings consistently).
| # Fix: Try regular JSON first (sent by audio_grabber.py) | ||
| data = request.get_json(silent=True) | ||
| if data: | ||
| audio_b64 = data.get('audio_b64') | ||
| chunk_id = data.get('chunk_id') | ||
| tenant_id = data.get('tenant_id', '0000') | ||
|
|
||
| # Fix: validate required fields and return 400 if missing | ||
| if not audio_b64 or not chunk_id: | ||
| return jsonify({'error': 'Missing required fields: audio_b64 and chunk_id'}), 400 |
There was a problem hiding this comment.
data = request.get_json(silent=True) followed by if data: treats an empty JSON object ({}) as "no JSON" and falls through to the streaming branch. Use an explicit is not None check (and then validate required fields) so malformed/empty JSON requests reliably return 400 instead of switching behavior based on truthiness.
| <button id="stopBtn" disabled>Stop</button> | ||
| <div id="controls"> | ||
| <input type="text" id="serverhost" placeholder="Transcribe Host" value="localhost"> | ||
| <input type="text" id="serverport" placeholder="Transcribe Port" value="5040"> |
There was a problem hiding this comment.
The HTML client still defaults to port 5040, while audio_grabber.py now posts to 5000 and the PR description says the Flask default should be 5000. Align this default with the actual server port (or surface the configured FLASK_PORT) to avoid confusing connection failures for users.
| <input type="text" id="serverport" placeholder="Transcribe Port" value="5040"> | |
| <input type="text" id="serverport" placeholder="Transcribe Port" value="5000"> |
| logger.info(f"VALID transcript for chunk_id {chunk_id}: {transcript}") | ||
| with threading.Lock(): # Ensure thread-safe access to shared resources | ||
| # we must distinguish between the case where the chunk_id is already in the transcripts | ||
| # this can happen quite often because the client will generate a new chunk_id only when | ||
| # the recorded audio has silence. So all chunks are those pieces with speech without a pause. | ||
|
|
||
| # get the current transcripts for the tenant_id | ||
| with threading.Lock(): | ||
| transcripts = transcriptd.get(tenant_id, None) |
There was a problem hiding this comment.
with threading.Lock(): creates a brand-new lock each time, so it does not actually synchronize access to transcriptd across threads. Use a single module-level lock (e.g., transcript_lock = threading.Lock()) and acquire that same lock for all reads/writes that need to be thread-safe (including in request handlers).
| sentences = request.args.get('sentences', default='false') == 'true' | ||
| if sentences: t = merge_and_split_transcripts(t) | ||
| chunk_id = request.args.get('chunk_id') | ||
| if chunk_id in t: | ||
| return jsonify({'chunk_id': chunk_id, 'transcript': t[chunk_id]['transcript']}) | ||
| else: |
There was a problem hiding this comment.
When sentences=true, t is replaced with the return value of merge_and_split_transcripts(t), but the handler still assumes t[chunk_id] is a dict containing ['transcript']. This will break once sentence merging is enabled unless the merge function preserves the same value shape. Ensure the "sentences" mode returns the same structure expected by these handlers (this affects several endpoints, not just /get_transcript).
| if use_whisper_server: | ||
| # Fix: properly call whisper.cpp /inference endpoint and parse response | ||
| try: | ||
| files = {'file': ('audio.wav', audio_array.tobytes(), 'audio/wav')} | ||
| data = {'response_format': 'json'} | ||
| response = requests.post(whisper_server, files=files, data=data) | ||
|
|
||
| else: | ||
| result = model_fast.transcribe(audio_tensor, temperature=0) | ||
| response = requests.post(f"{whisper_server}/inference", files=files, data=data) | ||
| response.raise_for_status() | ||
| result_json = response.json() | ||
| transcript = result_json.get('text', '').strip() |
There was a problem hiding this comment.
In whisper-server mode, this uploads audio_array.tobytes() after audio_array has been converted to normalized float32 samples, but the multipart part is labeled audio/wav and named audio.wav. This is not a valid WAV file (no header, wrong sample format), so whisper.cpp is unlikely to decode it correctly. Send a real WAV payload (e.g., wrap the original int16 PCM at 16kHz into a WAV container) or use the server's expected raw-audio format if supported.
| def start(self): | ||
| self.stream.start_stream() | ||
| self.send_thread = threading.Thread(target=self.send_chunk) | ||
| self.send_thread.start() |
There was a problem hiding this comment.
start() spawns a thread that calls send_chunk() exactly once, while send_chunk() is also called synchronously from audio_callback(). This extra thread is unnecessary and increases the risk of concurrent access to self.buffer/self.chunk_id without any locking. Either remove the thread and keep sending in the callback, or move all network sending into a dedicated loop/thread and protect shared state appropriately.
Fixes #19
Changes made:
Tested locally: chunks received, FP32 fallback working on CPU, POST /transcribe returning 200.