Fix/issue 40 dict keys type error#42
Conversation
…models and a background processing queue.
There was a problem hiding this comment.
Pull request overview
Fixes a Python 3 dict_keys indexing crash in transcript sentence-merging, and updates the Django app to use a real merge/split implementation instead of an identity placeholder, along with documentation cleanups.
Changes:
- Fix Python 3
dict_keysnon-subscriptable access by converting keys view to a list before indexing. - Replace Django transcription merge placeholder with the merge/split implementation.
- Expand/modernize project and API documentation; add standalone reproduction/verification scripts.
Reviewed changes
Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
verify_fix.py |
Adds an isolation script to validate merge/split behavior (currently has a failing expectation and data-shape mismatch). |
reproduce_bug.py |
Adds a minimal reproduction script for the dict_keys indexing error. |
flask/transcribe_server.py |
Applies the list(transcripts.keys())[-1] fix and updates docstrings (merge code still has a data-shape issue). |
django/transcribe_app/views.py |
Improves endpoint docstrings and parameter documentation. |
django/transcribe_app/transcribe_utils.py |
Switches Django to use merge/split logic instead of identity placeholder (current implementation returns inconsistent value types). |
django/docs/index.md |
Rewrites docs landing page content (drops disable_toc: true directive used elsewhere). |
README.md |
Updates project overview and quick-start instructions. |
.gitignore |
Adds venv ignore pattern. |
Comments suppressed due to low confidence (1)
django/transcribe_app/transcribe_utils.py:303
merge_and_split_transcriptsis documented as iterating over sorted chunk IDs, but it currently loops overtranscripts.keys()without sorting. Since chunk IDs are timestamp-like strings, this can merge/split sentences in the wrong order if inserts arrive out of order. Consider iterating withfor chunk_id in sorted(transcripts.keys(), key=int):(or otherwise guaranteeing chronological order).
def merge_and_split_transcripts(transcripts):
# Iterate through the sorted transcript keys.
sec = ".!?"
merged_transcripts = ""
result = {}
for chunk_id in transcripts.keys():
transcript_event = transcripts[chunk_id]
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # SUSI Translator Documentation | ||
|
|
||
| # Transcribe and Translate every audio stream | ||
| Welcome to the SUSI Translator documentation. This project provides a robust, real-time audio transcription and translation system. |
There was a problem hiding this comment.
The initial comment front-matter line disable_toc: true was removed from index.md, but other pages in django/docs/ still use it. If the docs renderer relies on this directive, the index page may start showing an unwanted/incorrect table of contents or be parsed differently than the other docs pages. Consider keeping disable_toc: true at the top for consistency.
| assert res["1700000001000"] == "welcome to the event" | ||
| print("Test passed: Unpunctuated tail handled correctly.") |
There was a problem hiding this comment.
merge_and_split_transcripts_logic() merges all provided chunks into merged_transcripts and, when there is no punctuation at all, assigns the entire merged string to the last key. With the current input, res['1700000001000'] will be "Hello everyone welcome to the event", so this assertion will fail. Either adjust the expected output or change the logic/test case to reflect the intended behavior.
| assert res["1700000001000"] == "welcome to the event" | |
| print("Test passed: Unpunctuated tail handled correctly.") | |
| assert res["1700000001000"] == "Hello everyone welcome to the event" | |
| print("Test passed: Unpunctuated content is merged and assigned to the last key.") |
| for key in transcripts.keys(): | ||
| if not merged_transcripts: | ||
| merged_transcripts += transcripts[key].strip() | ||
| else: | ||
| t = transcripts[key].strip() | ||
| if len(t) > 1: | ||
| merged_transcripts += " " + t[0].lower() + t[1:] | ||
| else: | ||
| merged_transcripts += " " + t | ||
|
|
There was a problem hiding this comment.
This verification script claims to be a copy of the fixed production logic, but it treats transcripts[key] as a string. In the Flask/Django servers, transcript collections are dicts of events (e.g., {'transcript': ...}), so this script isn’t exercising the same data shape as production. Consider updating the fixture inputs and access pattern to match production, otherwise the script can give a false sense of correctness.
| def merge_and_split_transcripts(transcripts): | ||
| # Iterate through the sorted transcript keys. | ||
| sec = ".!?" | ||
| merged_transcripts = "" |
There was a problem hiding this comment.
By renaming/enabling this implementation as merge_and_split_transcripts, the Django API will now use it when sentences=true. However, the current implementation builds result entries as plain strings during sentence splitting but later treats result[last_chunk_id] as a dict with a transcript field. This produces inconsistent return types and will break callers that expect t[chunk_id]['transcript'] for every entry. Update the function so it always returns the same event dict shape as the input (e.g., each value is a dict containing at least {'transcript': str} plus any other metadata).
| keys = list(transcripts.keys()) | ||
| if keys: | ||
| last_key = keys[-1] | ||
| p = result.get(last_key) | ||
| if p: | ||
| result[last_key] = p + " " + merged_transcripts | ||
| else: | ||
| result[last_key] = merged_transcripts |
There was a problem hiding this comment.
In merge_and_split_transcripts, transcripts[key] appears to be treated as a string (calling .strip()), but this server stores transcript entries as dicts like {'transcript': transcript}. If this function is ever invoked (e.g., when sentences=true is fixed/used), it will raise an AttributeError and the merge logic won’t work. Update the merge logic to read/write transcripts[key]['transcript'] (and ensure the returned result matches the dict structure the endpoints expect).
|
Hi, I saw that a PR is already linked to this issue. |
|
I noticed the Copilot review flagged a few additional issues — the function I've opened PR #45 which covers these cases along with the original fix |
This PR fixes the TypeError: 'dict_keys' object is not subscriptable that occurs in Python 3 when merging transcripts without ending punctuation. It also standardizes the transcription utility logic in the Django app by replacing the identity placeholder with the actual merging logic.
#40