Skip to content

Fix/issue 40 dict keys type error#42

Open
prakashwaddar628 wants to merge 3 commits intosusiai:masterfrom
prakashwaddar628:fix/issue-40-dict-keys-TypeError
Open

Fix/issue 40 dict keys type error#42
prakashwaddar628 wants to merge 3 commits intosusiai:masterfrom
prakashwaddar628:fix/issue-40-dict-keys-TypeError

Conversation

@prakashwaddar628
Copy link
Copy Markdown

This PR fixes the TypeError: 'dict_keys' object is not subscriptable that occurs in Python 3 when merging transcripts without ending punctuation. It also standardizes the transcription utility logic in the Django app by replacing the identity placeholder with the actual merging logic.

#40

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a Python 3 dict_keys indexing crash in transcript sentence-merging, and updates the Django app to use a real merge/split implementation instead of an identity placeholder, along with documentation cleanups.

Changes:

  • Fix Python 3 dict_keys non-subscriptable access by converting keys view to a list before indexing.
  • Replace Django transcription merge placeholder with the merge/split implementation.
  • Expand/modernize project and API documentation; add standalone reproduction/verification scripts.

Reviewed changes

Copilot reviewed 7 out of 8 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
verify_fix.py Adds an isolation script to validate merge/split behavior (currently has a failing expectation and data-shape mismatch).
reproduce_bug.py Adds a minimal reproduction script for the dict_keys indexing error.
flask/transcribe_server.py Applies the list(transcripts.keys())[-1] fix and updates docstrings (merge code still has a data-shape issue).
django/transcribe_app/views.py Improves endpoint docstrings and parameter documentation.
django/transcribe_app/transcribe_utils.py Switches Django to use merge/split logic instead of identity placeholder (current implementation returns inconsistent value types).
django/docs/index.md Rewrites docs landing page content (drops disable_toc: true directive used elsewhere).
README.md Updates project overview and quick-start instructions.
.gitignore Adds venv ignore pattern.
Comments suppressed due to low confidence (1)

django/transcribe_app/transcribe_utils.py:303

  • merge_and_split_transcripts is documented as iterating over sorted chunk IDs, but it currently loops over transcripts.keys() without sorting. Since chunk IDs are timestamp-like strings, this can merge/split sentences in the wrong order if inserts arrive out of order. Consider iterating with for chunk_id in sorted(transcripts.keys(), key=int): (or otherwise guaranteeing chronological order).
def merge_and_split_transcripts(transcripts):
    # Iterate through the sorted transcript keys.
    sec = ".!?"
    merged_transcripts = ""
    result = {}
    for chunk_id in transcripts.keys():
        transcript_event = transcripts[chunk_id]

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread django/docs/index.md
Comment on lines +1 to +3
# SUSI Translator Documentation

# Transcribe and Translate every audio stream
Welcome to the SUSI Translator documentation. This project provides a robust, real-time audio transcription and translation system.
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initial comment front-matter line disable_toc: true was removed from index.md, but other pages in django/docs/ still use it. If the docs renderer relies on this directive, the index page may start showing an unwanted/incorrect table of contents or be parsed differently than the other docs pages. Consider keeping disable_toc: true at the top for consistency.

Copilot uses AI. Check for mistakes.
Comment thread verify_fix.py
Comment on lines +53 to +54
assert res["1700000001000"] == "welcome to the event"
print("Test passed: Unpunctuated tail handled correctly.")
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

merge_and_split_transcripts_logic() merges all provided chunks into merged_transcripts and, when there is no punctuation at all, assigns the entire merged string to the last key. With the current input, res['1700000001000'] will be "Hello everyone welcome to the event", so this assertion will fail. Either adjust the expected output or change the logic/test case to reflect the intended behavior.

Suggested change
assert res["1700000001000"] == "welcome to the event"
print("Test passed: Unpunctuated tail handled correctly.")
assert res["1700000001000"] == "Hello everyone welcome to the event"
print("Test passed: Unpunctuated content is merged and assigned to the last key.")

Copilot uses AI. Check for mistakes.
Comment thread verify_fix.py
Comment on lines +7 to +16
for key in transcripts.keys():
if not merged_transcripts:
merged_transcripts += transcripts[key].strip()
else:
t = transcripts[key].strip()
if len(t) > 1:
merged_transcripts += " " + t[0].lower() + t[1:]
else:
merged_transcripts += " " + t

Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This verification script claims to be a copy of the fixed production logic, but it treats transcripts[key] as a string. In the Flask/Django servers, transcript collections are dicts of events (e.g., {'transcript': ...}), so this script isn’t exercising the same data shape as production. Consider updating the fixture inputs and access pattern to match production, otherwise the script can give a false sense of correctness.

Copilot uses AI. Check for mistakes.
Comment on lines +297 to 300
def merge_and_split_transcripts(transcripts):
# Iterate through the sorted transcript keys.
sec = ".!?"
merged_transcripts = ""
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By renaming/enabling this implementation as merge_and_split_transcripts, the Django API will now use it when sentences=true. However, the current implementation builds result entries as plain strings during sentence splitting but later treats result[last_chunk_id] as a dict with a transcript field. This produces inconsistent return types and will break callers that expect t[chunk_id]['transcript'] for every entry. Update the function so it always returns the same event dict shape as the input (e.g., each value is a dict containing at least {'transcript': str} plus any other metadata).

Copilot uses AI. Check for mistakes.
Comment on lines +258 to +265
keys = list(transcripts.keys())
if keys:
last_key = keys[-1]
p = result.get(last_key)
if p:
result[last_key] = p + " " + merged_transcripts
else:
result[last_key] = merged_transcripts
Copy link

Copilot AI Apr 21, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In merge_and_split_transcripts, transcripts[key] appears to be treated as a string (calling .strip()), but this server stores transcript entries as dicts like {'transcript': transcript}. If this function is ever invoked (e.g., when sentences=true is fixed/used), it will raise an AttributeError and the merge logic won’t work. Update the merge logic to read/write transcripts[key]['transcript'] (and ensure the returned result matches the dict structure the endpoints expect).

Copilot uses AI. Check for mistakes.
@kirtisrivastava22
Copy link
Copy Markdown
Contributor

Hi, I saw that a PR is already linked to this issue.
I’d be happy to help review, test, or contribute improvements if needed. I’ve recently contributed and have the project set up locally.
Let me know if I can assist!

@kirtisrivastava22
Copy link
Copy Markdown
Contributor

I noticed the Copilot review flagged a few additional issues — the function
also needs to access transcripts[key]['transcript'] instead of treating
values as plain strings, and return consistent dict shapes so callers
like get_transcript don't break.

I've opened PR #45 which covers these cases along with the original fix
and includes tests. Tagging @mariobehling in case it's helpful for review!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

3 participants