feat: implement dual-phase contextual transcription for dictation (GH-256) #259

rosscado · 2025-10-26T11:21:23Z

Summary

Implements a two-phase transcription process for Universal Dictation mode to significantly improve accuracy while maintaining low-latency feedback.

Phase 1 - Live Streaming (unchanged)

Short audio segments transcribed in real-time for immediate text updates, maintaining the natural "live typing" experience.

Phase 2 - Contextual Refinement (new)

When dictation completes (via endpoint detection or field blur), all buffered audio is re-processed as a single unified segment. This provides full context to the transcription model, resulting in better accuracy, punctuation, and phrasing.

Key Features

Audio Buffering System

✅ Per-target audio storage (audioSegmentsByTarget) prevents race conditions
✅ Stores blob, frames, duration, sequence number, and timestamps
✅ Supports rapid multi-field navigation (users can tab through forms quickly)
✅ Memory-safe with automatic cleanup after ~5 seconds

Endpoint Detection

✅ Reuses proven calculateDelay logic from ConversationMachine
✅ Uses pFinishedSpeaking and tempo from API responses
✅ 5-second max delay (tuned for p50 transcription time of ~2s + network)

Refinement Triggers

✅ Automatic endpoint detection (via delay timer)
✅ Field blur (user tabs away)
✅ Manual stop (hangup/cancel)
✅ 10-second timeout fallback

State Machine Enhancements

New refining state in listening.converting
New context fields for tracking refinements per-target
New performContextualRefinement action
Enhanced handleTranscriptionResponse to detect/process refinements
Cleanup logic integrated into finalizeDictation and resetDictationState

No API Changes Required

✅ Works with existing /transcribe endpoint
✅ Empty precedingTranscripts (context is in the audio itself)
✅ KISS principle maintained

Privacy & Memory Management

✅ Audio buffers stored only in-memory
✅ Cleared after refinement completion
✅ Cleared on field blur (after refinement sent)
✅ Cleared on dictation stop/cancel
✅ Never persisted to disk
✅ Max ~5 seconds buffered per field

Code Changes

Modified Files:

src/state-machines/DictationMachine.ts (+381 lines)
- Audio buffering infrastructure
- Refinement action and response handling
- State transitions and delays
- Cleanup functions
- Exported DictationTranscribedEvent type for reuse
src/UniversalDictationModule.ts (+16 lines)
- Blur handler to trigger refinement
- Type annotations for EventBus listeners
- Import and reuse of DictationTranscribedEvent type

Build Status: ✅ Passes (55.74 MB output, no errors)

Benefits

Significantly improved transcription accuracy and consistency
Seamless correction with minimal visual disruption
Maintains hands-free flow across multiple fields
Aligns Say Pi dictation quality with dedicated transcription tools
Zero breaking changes - purely additive enhancement

Testing

Build passes
TypeScript type checking passes
Manual testing with form fields (ready for QA)
Multi-field rapid navigation
Endpoint detection timing
Field blur refinement trigger

Future Enhancements (out of scope)

Visual "Refining..." indicator (placeholder text)
Experiment with /transcribe prefer parameter
A/B test different max delays
User preference to enable/disable refinement

Closes #256

🤖 Generated with Claude Code

Co-Authored-By: Claude [email protected]

…-256) Implements a two-phase transcription process for Universal Dictation mode to significantly improve accuracy while maintaining low-latency feedback. ## Phase 1 - Live Streaming (unchanged) Short audio segments transcribed in real-time for immediate text updates, maintaining the natural "live typing" experience. ## Phase 2 - Contextual Refinement (new) When dictation completes (via endpoint detection or field blur), all buffered audio is re-processed as a single unified segment. This provides full context to the transcription model, resulting in better accuracy, punctuation, and phrasing. ### Key Implementation Details **Audio Buffering (per-target)** - Added audioSegmentsByTarget to DictationContext for multi-field support - Segments stored with blob, frames, duration, sequence number, timestamp - Per-target tracking prevents race conditions with rapid form navigation - Memory-safe: buffers cleared after refinement or timeout (~5s max) **Endpoint Detection** - Reuses proven ConversationMachine logic (calculateDelay from TimerModule) - Uses pFinishedSpeaking + tempo from /transcribe API responses - 5 second max delay (vs 7s for conversation, accounts for p50 ~2s + network) - Triggers refinement automatically when user finishes speaking **Refinement Triggers** 1. Endpoint timer (after pFinishedSpeaking confidence threshold) 2. Field blur (user tabs/clicks away from field) 3. Manual stop (hangup/cancel button) 4. 10s timeout fallback (same as conversation mode) **Refinement Process** - Concatenates all audio segments for target element - Uploads unified blob with empty precedingTranscripts (context in audio) - No API changes needed - works with existing /transcribe endpoint - Detects response via sequence number tracking in activeRefinementSequences - Replaces all Phase 1 interim transcriptions with refined result **State Machine Changes** - New DictationContext fields: - audioSegmentsByTarget: Record<string, AudioSegment[]> - refinementPendingForTargets: Set<string> - activeRefinementSequences: Map<string, number> - timeLastTranscriptionReceived: number - New state: listening.converting.refining - New guard: refinementConditionsMet - New delay: refinementDelay (uses calculateDelay with pFinished + tempo) - New action: performContextualRefinement - Enhanced handleTranscriptionResponse to detect and process refinements **UI Integration** - Blur handler in UniversalDictationModule triggers refinement - Visual state (refining) available for future UI indicators - Graceful degradation: Phase 1 text preserved if Phase 2 fails ### Benefits - Significantly improved transcription accuracy and consistency - Seamless correction with minimal visual disruption - Maintains hands-free flow across multiple fields - Aligns Say Pi dictation quality with dedicated transcription tools - Zero breaking changes - purely additive enhancement ### Privacy & Memory - Audio buffers exist only in-memory during dictation session - Automatically cleared after refinement completion - No persistent storage or reuse of voice data - Max ~5 seconds of audio buffered per field Related: #256 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Fixes TypeScript linter errors for implicit 'any' types on EventBus event data parameters in UniversalDictationModule. - dictation:contentUpdated: { targetElement: HTMLElement } - dictation:terminatedByManualEdit: { targetElement: HTMLElement; reason: string } 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Fixes additional TypeScript linter errors for implicit 'any' types: - Events with detail data (USER_STOPPED_SPEAKING, etc.): any type (Union type too complex; using any with runtime validation) - saypi:transcription:completed: Explicit type with all fields (text, sequenceNumber, pFinishedSpeaking, tempo, merged) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Eliminates duplicate type definition by: - Exporting DictationTranscribedEvent from DictationMachine - Importing and using Omit<DictationTranscribedEvent, 'type'> in UniversalDictationModule for transcription:completed listener This follows DRY principle and ensures type consistency across the codebase. The 'type' field is omitted since EventBus events don't include it in the payload (it's added when sending to the machine). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

src/state-machines/DictationMachine.ts

Critical fix: i18n-translate-keys.py now aborts when it encounters a locale file with JSON syntax errors, instead of silently replacing it with an empty structure. Previous behavior (DANGEROUS): - If messages.json had a syntax error, load_json_file() returned None - Code converted None to {} and continued - Script would overwrite the file with only the newly translated keys - All existing translations in that locale would be permanently lost New behavior (SAFE): - If messages.json fails to parse, show clear error message - Abort immediately with exit code 1 - User can fix JSON errors and retry safely - No data loss Reported by: Codex during code review

Changed MAX_AUDIO_BUFFER_DURATION_MS from implicit ~5s to explicit 120s to provide maximum context for Phase 2 contextual refinement. The previous short buffer undermined the value proposition - we need full dictation session audio (not just recent segments) to achieve significantly better accuracy. The 120s cap prevents unbounded memory growth if users leave hot-mic on. Implementation: - Added MAX_AUDIO_BUFFER_DURATION_MS = 120000 (2 minutes) - Automatic trimming of oldest segments when limit exceeded - Enhanced logging shows total buffered audio duration - Sliding window keeps most recent 120s of audio per field This aligns with the core goal: long-form context = better accuracy. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

Fixes P1 code review issue where refinement transcriptions were being discarded due to missing sequence→element mapping. When uploading combined audio for Phase 2 refinement, we now properly map the returned sequence number to the target element via `context.transcriptionTargets[sequenceNum]`. This allows the refinement response handler to correctly identify and update the target field. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

rosscado · 2025-10-26T11:41:53Z

Fixed in commit 59500eb.

Analysis:
While refinement combines multiple Phase 1 audio segments from the client's perspective, it's a brand new transcription request from the API's perspective. The uploadAudioWithRetry() call returns a new, unique sequence number that must be mapped to enable response routing.

Changes:

Added context.transcriptionTargets[sequenceNum] = targetElement when refinement request succeeds (line 1981)
Added cleanup of old Phase 1 sequence→element mappings when processing refinement response (line 1469)

This reuses the existing routing infrastructure (transcriptionTargets) rather than creating separate refinement-specific routing, keeping the code simpler and more maintainable.

Copilot

Pull Request Overview

This PR implements a dual-phase transcription system for dictation to improve accuracy. Phase 1 provides real-time streaming transcription for immediate feedback, while Phase 2 performs contextual refinement by re-processing all buffered audio as a single segment when dictation completes. This approach maintains low-latency user experience while significantly improving final transcription quality.

Key Changes:

Added audio buffering system that stores segments per target element with automatic cleanup
Implemented endpoint detection using existing calculateDelay logic to trigger refinement automatically
Added refinement triggers for field blur, manual stop, and endpoint detection with state machine enhancements

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File	Description
src/state-machines/DictationMachine.ts	Core implementation including audio buffering infrastructure, refinement state/actions, endpoint detection delay calculation, and cleanup logic
src/UniversalDictationModule.ts	Integration of field blur handler to trigger refinement and type annotations for EventBus listeners

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/state-machines/DictationMachine.ts

src/UniversalDictationModule.ts

rosscado · 2025-10-26T12:10:23Z

Re: Copilot's performance suggestion (cleanup loop)

I considered maintaining a separate Set/Array for sequence numbers to avoid parseInt() in the cleanup loop, but decided against it for now because:

Infrequent operation - Cleanup only runs when refinement completes (once per field)
Small dataset - Typically 5-20 sequences per field, so parseInt() overhead is negligible
Code simplicity - Current approach is straightforward and doesn't require maintaining parallel data structures
No evidence of bottleneck - Would need profiling to justify the added complexity

Happy to revisit if profiling shows this is a performance issue in practice.

- Extract ellipsis normalization into shared normalizeTranscriptionText() function (DRY) - Improve comment explaining 5s vs 7s endpoint delay difference with rationale - Replace 'any' type with proper union type for event details (type safety) - Export event types from DictationMachine for reuse in UniversalDictationModule Performance nitpick (cleanup loop) evaluated but deferred - not a bottleneck given infrequent operation and small dataset size. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

rosscado · 2025-10-26T12:11:19Z

✅ All Copilot code review feedback addressed

Changes in commit `ae3e7bd`:

DRY violation (ellipsis normalization) - ✅ Fixed
- Extracted duplicate normalization logic into normalizeTranscriptionText() helper function
- Used in both refinement handler and Phase 1 handler
Unclear comment (5s vs 7s delay) - ✅ Fixed
- Expanded comment explaining why dictation uses 5s vs ConversationMachine's 7s
- Added context about faster p50 transcription time in dictation mode
Type safety (any type) - ✅ Fixed
- Exported event types from DictationMachine
- Replaced any with proper union type: Omit<DictationSpeechStoppedEvent, 'type'> | ...
- Full type safety for all event handlers
Performance nitpick (cleanup loop) - ✅ Evaluated, deferred
- Considered maintaining separate Set/Array for sequence numbers
- Decided against it: infrequent operation, small dataset, no evidence of bottleneck
- Happy to revisit if profiling shows it's an issue

All substantive feedback addressed. Build passing ✅

rosscado · 2025-10-26T12:15:05Z

PR-259 Review Comments (cheetah)

Overall Review

Status: ❌ Request Changes

The implementation of dual-phase contextual transcription is architecturally sound and follows existing patterns well. However, critical gaps exist in test coverage that violate the project's TDD requirements outlined in AGENTS.md.

Critical Issues

Missing Test Coverage - Zero tests for refinement functionality
Potential Memory Leak - Refinement sequence mapping cleanup issue
Type Safety - Unsafe access to transcriptionTargets

Improvements Needed

Inconsistent initial text handling
Missing state validation in refinement trigger
Hardcoded delay values

Inline Comments

Comment 1: Line 24 - Type Export

File: src/state-machines/DictationMachine.ts:24

Good change exporting this type for reuse. Consider adding JSDoc explaining that this is shared across modules for type consistency.

Comment 2: Line 401 - Missing Test Coverage (CRITICAL)

File: src/state-machines/DictationMachine.ts:401

Excellent implementation of audio buffer management with automatic trimming. The 120s limit is well-chosen to prevent unbounded memory growth.

Critical gap: No tests exist for this function or the refinement feature. According to project TDD requirements, all new functionality must have tests before merge.

Action Required: Add tests for:

Audio segment storage
Buffer trimming when exceeding 120s
Per-target buffering
Cleanup scenarios

Comment 3: Line 520 - Silent Failure Case

File: src/state-machines/DictationMachine.ts:520

Good conditional buffering - only buffers when frames are available. However, this means refinement is silently skipped if frames parameter is missing. Consider logging a warning when frames are not provided but refinement is expected.

Comment 4: Line 1446 - Type Safety Issue

File: src/state-machines/DictationMachine.ts:1446

Type safety issue: transcriptionTargets is typed as Record<number, HTMLElement> but could contain undefined due to cleanup elsewhere. Consider:

const targetElement = context.transcriptionTargets[sequenceNumber] as HTMLElement | undefined;

Or add a dedicated refinement mapping structure that's type-safe.

Comment 5: Line 1482 - Inconsistent Spacing Logic

File: src/state-machines/DictationMachine.ts:1482

Inconsistent spacing logic. This always adds a space between initial text and transcription, but doesn't check if either already has whitespace. Should use smartJoinTwoTexts helper for consistency:

const finalText = smartJoinTwoTexts(initialText, transcription);

Comment 6: Line 1982 - Potential Memory Leak

File: src/state-machines/DictationMachine.ts:1982

Potential memory leak: If refinement fails after this mapping (see catch block at line 1991), transcriptionTargets[sequenceNum] remains but activeRefinementSequences is cleaned up. This creates an orphaned mapping.

Fix: Add cleanup in catch block:

.catch((error) => {
  console.error("[DictationMachine] Refinement transcription failed:", error);
  clearAudioForTarget(context, targetId);
  delete context.transcriptionTargets[sequenceNum]; // Add this
});

Comment 7: Line 2018 - Edge Case Missing

File: src/state-machines/DictationMachine.ts:2018

Good guard condition - clear and concise. However, missing edge case: What if refinement is triggered but no segments exist? Currently silently returns without state transition (line 1928). Should either:

Transition back to accumulating state
Add explicit transition in the refining state entry action

Comment 8: Line 2031 - Magic Number

File: src/state-machines/DictationMachine.ts:2031

Magic number makes testing difficult. Extract to constant:

const REFINEMENT_MAX_DELAY_MS = 5000;
const DICTATION_P50_TRANSCRIPTION_TIME_MS = 2000;

This improves testability and makes configuration easier.

Comment 9: Test File - No Tests Exist (CRITICAL)

File: test/state-machines/DictationMachine.spec.ts:1

CRITICAL: No tests exist for dual-phase refinement feature. Search confirms zero coverage:

grep -i "refinement|refine|phase" test/state-machines/DictationMachine.spec.ts
# Returns: No matches found

Required test coverage:

✅ Audio segment buffering
✅ Refinement delay calculation
✅ Audio concatenation
✅ Refinement response handling
✅ State transitions (accumulating → refining)
✅ Buffer cleanup
✅ Error scenarios

This violates project TDD requirements and is blocking merge.

Test Strategy Recommendations

Unit Tests Needed

describe('Dual-Phase Refinement', () => {
  it('should buffer audio segments when frames are provided')
  it('should trigger refinement after endpoint delay')
  it('should concatenate audio segments correctly')
  it('should replace Phase 1 transcriptions with refined result')
  it('should handle refinement failure gracefully')
  it('should respect audio buffer duration limit (120s)')
  it('should clear audio buffers on target switch')
  it('should calculate refinement delay based on pFinishedSpeaking and tempo')
  it('should handle concurrent refinements for multiple targets')
  it('should cleanup orphaned refinement mappings on error')
})

Mock Strategy

Mock convertToWavBlob to verify concatenation
Mock uploadAudioWithRetry to control refinement responses
Mock EventBus.emit to verify event emission
Control Date.now() for timing-sensitive tests

Positive Observations

✅ Proper audio buffer duration limiting prevents memory leaks
✅ Per-target audio buffering maintains proper isolation
✅ Graceful error recovery when refinement fails
✅ Proper cleanup in finalizeDictation action
✅ Good integration with existing state machine patterns
✅ Exported DictationTranscribedEvent type for reuse

Checklist Before Merge

Must Fix (Blocking)

Add comprehensive test coverage for dual-phase refinement
Fix potential memory leak in refinement mapping cleanup
Add type guards for transcriptionTargets access

Should Fix (Non-blocking)

Use smartJoinTwoTexts for consistent spacing
Extract magic numbers to constants
Handle refinement trigger with no segments case
Add warning log when frames missing

Nice to Have

Add JSDoc for DictationTranscribedEvent export
Add state diagram documentation for refinement flow
Consider performance optimization for console.debug calls

Estimated Effort

To address blocking issues: 2-3 hours

Breakdown:

Test coverage: 1.5-2 hours
Memory leak fix: 15 minutes
Type safety improvements: 15 minutes

Final Recommendation

The implementation demonstrates solid architectural choices and integrates well with existing code. However, the complete absence of test coverage for such a critical feature represents unacceptable risk.

Action: Add test coverage before merging. Once tests are added and all existing tests still pass, this PR will be ready for approval.

**Code Quality Improvements:** - Use smartJoinTwoTexts for consistent whitespace handling in refinement - Extract REFINEMENT_MAX_DELAY_MS constant for better testability - Export event types (DictationSpeechStoppedEvent, etc.) for type safety **Test Coverage (CRITICAL):** Added comprehensive test suite (DictationMachine-Refinement.spec.ts) with 15 tests: - ✅ Audio segment buffering and 120s limit enforcement - ✅ Refinement delay calculation with pFinishedSpeaking/tempo - ✅ Audio concatenation for multi-segment refinement - ✅ Phase 1 → Phase 2 transcription replacement - ✅ State transitions (accumulating → refining → accumulating) - ✅ Buffer cleanup on refinement completion - ✅ Error scenarios (upload failure, missing target) - ✅ Concurrent refinements for multiple targets **Cheetah Review Analysis:** - Memory leak (Comment 6): FALSE POSITIVE - .then() never runs on upload failure - Type safety (Comment 4): Already handled with null check at line 1460 - No segments case (Comment 7): Already handled at lines 1934-1939 - Missing frames warning (Comment 3): REJECTED - creates unnecessary log noise All 15 tests passing ✅ 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

rosscado · 2025-10-26T12:20:23Z

✅ All Cheetah code review feedback addressed

Commits: ae3e7bd (Copilot fixes) + a4e1d63 (Cheetah fixes + tests)

Critical Issues - Resolved

1. ✅ Test Coverage (CRITICAL - was blocking merge)

Created: test/state-machines/DictationMachine-Refinement.spec.ts with 15 comprehensive tests

Test coverage includes:

Audio segment buffering with 120s limit enforcement
Refinement delay calculation using pFinishedSpeaking/tempo
Audio concatenation for multi-segment refinement
Phase 1 → Phase 2 transcription replacement
State transitions (accumulating → refining → accumulating)
Buffer cleanup on refinement completion
Error handling (upload failure, missing target element)
Concurrent refinements for multiple targets

Result: All 15 new tests pass ✅
Regression: All 607 existing tests still pass ✅

2. ❌ "Memory Leak" in Refinement Mapping (Comment 6)

Cheetah's claim: If refinement fails, transcriptionTargets[sequenceNum] remains but activeRefinementSequences is cleaned up

Critical analysis: FALSE POSITIVE
The mapping is added inside .then() block (line 1987), which only executes if uploadAudioWithRetry succeeds. If upload fails, .catch() runs instead and .then() never executes, so the mapping is never created. No cleanup needed.

Action: None required - no memory leak exists

3. ❌ Type Safety Issue (Comment 4)

Cheetah's claim: transcriptionTargets access needs type guard

Critical analysis: Already handled correctly
Lines 1459-1466 already check if (!targetElement) and return early with cleanup. This is the preferred pattern over type assertions.

Action: None required - type safety already ensured

Improvements Made

4. ✅ Inconsistent Spacing Logic (Comment 5)

Fixed: Now uses smartJoinTwoTexts(initialText, transcription) at line 1491 for consistent whitespace handling

5. ✅ Magic Numbers (Comment 8)

Fixed: Extracted REFINEMENT_MAX_DELAY_MS = 5000 constant with documentation explaining why it's 5s vs ConversationMachine's 7s

Rejected Suggestions

6. ❌ Warning Log When Frames Missing (Comment 3)

Cheetah's claim: Should log warning when frames parameter missing

Critical analysis: Rejected - creates unnecessary noise
Frames are optional by design - not all callers have them available. Phase 1 works fine without frames; they're only needed for Phase 2 refinement. Logging a warning for expected behavior creates log pollution.

Action: Not implemented

7. ❌ Handle No Segments Case (Comment 7)

Cheetah's claim: Missing edge case handling when refinement triggered with no segments

Critical analysis: Already handled
Lines 1934-1939 check for !segments || segments.length === 0, log debug message, clean up pending flag, and return early. This is correct handling - it's not an error case, just a no-op.

Action: None required - already handled correctly

Summary

Blocking issues resolved:

✅ Comprehensive test coverage added (15 tests, all passing)

Valid improvements implemented:

✅ Consistent spacing with smartJoinTwoTexts
✅ Extracted magic numbers to documented constants
✅ Improved type safety exports

False positives identified:

❌ Memory leak claim (incorrect understanding of Promise flow)
❌ Type safety issue (already handled with null check)
❌ Missing edge case (already handled correctly)

Noise reduction:

❌ Rejected unnecessary warning log

All tests passing: 607/607 ✅
Build status: ✅ Passing

Ready for final review and merge.

rosscado · 2025-10-26T12:25:35Z

Summary of Changes

Commits on this PR

24a9d5f - Initial implementation: dual-phase transcription with 120s audio buffering
33069fc - Fix: abort on JSON parse failure to prevent i18n data loss
e45fd7e - Refactor: export and reuse DictationTranscribedEvent type (DRY)
6168c19 - Fix: add type annotations to remaining EventBus listeners
7fa17a2 - Fix: add type annotations to EventBus listeners (initial batch)
59500eb - Fix: map refinement sequence to target for response routing + cleanup
ae3e7bd - Refactor: address Copilot code review feedback
a4e1d63 - Refactor: address Cheetah code review feedback + comprehensive tests

What Changed (Detailed Breakdown)

Core Feature Implementation (`24a9d5f`)

Dual-phase transcription system: Phase 1 (streaming) + Phase 2 (refinement)
Audio buffering: Store Float32Array frames per target element
120s sliding window: Automatic trimming of oldest segments to prevent memory bloat
Endpoint detection: Reuse ConversationMachine's calculateDelay with 5s max delay
Per-target state tracking: Support rapid multi-field navigation without race conditions
Refinement triggers: Automatic (endpoint timer), field blur, manual stop
Empty precedingTranscripts: Context is in the unified audio itself

Type Safety & Code Quality (`e45fd7e`, `6168c19`, `7fa17a2`)

Exported DictationTranscribedEvent type for reuse across modules
Added explicit type annotations to all EventBus listeners
Exported DictationSpeechStoppedEvent, DictationAudioConnectedEvent, DictationSessionAssignedEvent
Replaced any types with proper union types

Bug Fixes (`33069fc`, `59500eb`)

Fixed i18n JSON parse to abort on failure instead of corrupting data
Fixed P1 issue: map refinement sequence to target element for response routing
Added cleanup of old Phase 1 sequence→element mappings when refinement replaces them

Copilot Review Improvements (`ae3e7bd`)

Extracted duplicate ellipsis normalization into normalizeTranscriptionText() helper
Improved comment explaining 5s vs 7s endpoint delay rationale
Full type safety for event handlers with proper union types

Cheetah Review Improvements (`a4e1d63`)

Test Coverage: Added DictationMachine-Refinement.spec.ts with 15 comprehensive tests:
- Audio segment buffering with 120s limit enforcement
- Refinement delay calculation with pFinishedSpeaking/tempo
- Audio concatenation for multi-segment refinement
- Phase 1 → Phase 2 transcription replacement
- State transitions (accumulating → refining → accumulating)
- Buffer cleanup on refinement completion
- Error scenarios (upload failure, missing target)
- Concurrent refinements for multiple targets
Code Quality: Use smartJoinTwoTexts for consistent whitespace handling
Testability: Extract REFINEMENT_MAX_DELAY_MS constant with documentation

What Didn't Change (And Why)

Cheetah's "memory leak" concern (Comment 6): The suggested cleanup in the catch block is unnecessary because the mapping at line 1987 is inside the .then() block, which only executes on successful upload. If upload fails, .catch() runs instead and the mapping is never created.

Cheetah's "type safety" concern (Comment 4): The null check at lines 1459-1466 already handles this correctly with early return and cleanup.

Cheetah's "no segments" concern (Comment 7): Already handled at lines 1934-1939 with appropriate debug logging and cleanup.

Cheetah's "warning log" suggestion (Comment 3): Rejected because frames are optionally by design. Logging a warning for expected behavior would create log pollution.

Test Results

✅ 15 new refinement tests - all passing
✅ 607 total tests - all passing
✅ Build - passing (55.75 MB)
✅ TypeScript - no errors

Ready for final review and merge.

rosscado · 2025-10-26T13:13:06Z

Code Review comments by Codex (gpt-5-high, local)

codex – src/state-machines/DictationMachine.ts:1920
When the delayed after transition fires we rely on context.targetElement, but the guard that scheduled the refinement used the stale target ID sitting in refinementPendingForTargets. If the user switches focus between speech segments (or we split a single utterance across multiple targets) the targetElement no longer matches the pending entry, so we conclude there are “no audio segments” and bail. Result: the original field never reaches Phase‑2 and the machine stays in refining with the pending set uncleared. Please grab the actual pending target(s) from refinementPendingForTargets (or track the sequence-to-target mapping) instead of assuming the current target, and add a spec that reproduces a target switch mid-utterance so we don’t regress again.
codex – src/state-machines/DictationMachine.ts:1906
On manual edit we terminate dictation, but we leave audioSegmentsByTarget, refinementPendingForTargets, and activeRefinementSequences untouched. That can hold ~120 s of buffered audio after the user opted out, and the next dictation on that element will try to refine stale speech. Please call clearAudioForTarget (or clearAllAudioBuffers) here and add a regression test that buffers audio, fires saypi:manualEdit, and asserts the buffers/flags are cleared.

**Critical Bug Fixes:** 1. **Target switch bug in performContextualRefinement** (Codex Comment 1) - **Problem**: When user switches targets mid-dictation, endpoint delay fires but refinement checks `context.targetElement` (now the new target) instead of the original target with pending audio. Result: original field never refined, machine stuck in refining state. - **Fix**: Iterate over `refinementPendingForTargets` to refine ALL pending targets, not just current one. Handles rapid multi-field navigation. - **Test**: Added "should refine all pending targets even after target switch" 2. **Manual edit cleanup bug** (Codex Comment 2) - **Problem**: On manual edit, dictation terminates but leaves up to 120s of buffered audio in `audioSegmentsByTarget`, `refinementPendingForTargets`, and `activeRefinementSequences`. Next dictation on that element would try to refine stale speech. - **Fix**: Call `clearAudioForTarget()` in `handleManualEdit` to clear all refinement state and audio buffers. - **Test**: Added "should clear audio buffers and refinement state on manual edit" **Test Coverage:** - ✅ 17 refinement tests (was 15, now 17) - ✅ 609 total tests passing Both bugs would have caused user-visible issues in production. Codex review was 100% correct on both counts. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

rosscado · 2025-10-26T13:18:44Z

✅ Codex Code Review - Both Issues Fixed

Commit: 2a56558

Critical Bug #1: Target Switch During Refinement

Codex's Analysis: ✅ CORRECT

When user switches focus between speech segments, performContextualRefinement relied on stale context.targetElement instead of tracking pending targets. Result: original field never refined, machine stuck in refining state with uncleared pending set.

The Fix:

Iterate over refinementPendingForTargets to refine ALL pending targets
Look up target elements via context.transcriptionTargets mapping
Handles rapid multi-field navigation correctly

Code Changes:

// Before: Used current target (wrong!)
let targetElement = context.targetElement;

// After: Process ALL pending targets
for (const targetId of context.refinementPendingForTargets) {
  const targetElement = Object.values(context.transcriptionTargets).find(
    el => getTargetElementId(el) === targetId
  );
  // ... refine this target
}

Test Added: "should refine all pending targets even after target switch (Codex bug)"

Dictates into field A with endpoint indicators
Switches to field B (updates context.targetElement)
Waits for endpoint delay to fire
Verifies field A is still refined (not ignored)

Critical Bug #2: Manual Edit Cleanup

Codex's Analysis: ✅ CORRECT

On manual edit, dictation terminated but left up to 120s of buffered audio and refinement state intact. Next dictation on that element would try to refine stale speech from previous session.

The Fix:

Call clearAudioForTarget(context, targetId) in handleManualEdit
Clears audioSegmentsByTarget, refinementPendingForTargets, activeRefinementSequences

Code Changes:

// Added in handleManualEdit after clearing transcriptions:
// Clear audio buffers and refinement state for this target
// This prevents stale audio (up to 120s) from being refined later
clearAudioForTarget(context, targetId);

Test Added: "should clear audio buffers and refinement state on manual edit (Codex bug)"

Buffers 3 segments (3s of audio)
Fires saypi:manualEdit
Verifies all buffers and refinement flags are cleared

Impact Assessment

Both bugs would have caused production issues:

Target switch bug: Unrefined text in multi-field forms, stuck machine state
Manual edit bug: Stale audio refined into wrong field, confusing user edits

Codex Review Quality: 🏆 Excellent

Both issues correctly identified
Clear reproduction scenarios provided
Appropriate severity (both marked for regression tests)

Test Results

✅ 17 refinement tests (was 15, added 2 new)
✅ 609 total tests passing
✅ Build passing

Both critical bugs fixed with regression tests. Ready for final review.

**New Features:** - Added `onSequenceNumber` callback to `uploadAudioWithRetry` for better sequence management during audio uploads. - Introduced `activeRefinementSequenceLookup` and `refinementQueue` in `DictationContext` to track active refinements and manage in-flight audio segments. - Updated logic in `clearAudioForTarget` and `uploadAudioSegment` to maintain accurate mapping of sequence numbers to target elements. **Test Coverage:** - Updated tests to validate new callback functionality and ensure proper handling of refinement sequences. This update improves the robustness of the dictation machine by ensuring that sequence numbers are correctly managed and associated with their respective targets, enhancing the overall user experience during dictation sessions.

…dio upload **New Features:** - Documented `inputType`, `inputLabel`, and `onSequenceNumber` parameters to `uploadAudioWithRetry` in the ConversationMachine to improve readability, even though they are not currently utilized in conversation mode.

**New Features:** - Replaced manual audio segment persistence logic with a dedicated `persistAudioSegment` function in both `AudioInputMachine` and `DictationMachine`. - Enhanced the ability to optionally persist audio segments during user interactions, improving overall audio management. This update streamlines the audio persistence process, making the code cleaner and more maintainable.

**New Features:** - Introduced `uploadAudioForRefinement` function for handling audio uploads specifically for the refinement phase, bypassing sequence number tracking. - Enhanced `DictationMachine` to manage refinement requests using UUIDs, allowing for better tracking and handling of audio segments. - Updated event emissions for refinement completion and failure, improving telemetry and debugging capabilities. - Refined state management in `DictationContext` to replace old sequence-based tracking with a more robust UUID-based approach. **Test Coverage:** - Expanded test suite to cover new refinement logic, ensuring proper handling of audio uploads and state transitions. This update significantly improves the refinement process, enabling more accurate and context-aware transcriptions while maintaining a clean and maintainable codebase.

Copilot

Pull Request Overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/state-machines/DictationMachine.ts

Copilot · 2025-10-27T18:52:09Z

src/state-machines/DictationMachine.ts

+          // Skip refinement if only 1 segment (already fully transcribed in Phase 1)
+          if (segments.length === 1) {
+            console.debug(`[DictationMachine] Skipping refinement for target ${targetId} until more segments arrive`);
+            continue; // Keep buffering, don't clear
+          }


[nitpick] The logic skips refinement for single segments, but the comment 'already fully transcribed in Phase 1' may be misleading. The real reason is that a single segment has no additional context to gain from refinement. Consider clarifying: 'Skip refinement if only 1 segment (no additional context available for improvement)'.

Copilot · 2025-10-27T18:52:10Z

src/TranscriptionModule.ts

+/**
+ * Upload audio for refinement pass (Phase 2 dual-phase transcription).
+ * This is a bare-bones transcription request that bypasses sequence number tracking.
+ *
+ * Key differences from uploadAudioWithRetry():
+ * - No sequenceNumber parameter or tracking
+ * - No precedingTranscripts/acceptsMerge (full audio → full transcription)
+ * - No increment of global sequence counter
+ * - Uses requestId (UUID) for response correlation instead
+ * - Emits separate event: saypi:refinement:completed


The documentation mentions emitting 'saypi:refinement:completed' but should also mention 'saypi:refinement:failed' and 'saypi:refinement:started' events for completeness.

src/TranscriptionModule.ts

src/audio/AudioSegmentPersistence.ts

Co-authored-by: Copilot <[email protected]>

All changes are documentation/code quality improvements with no functional changes: 1. DictationMachine.ts: - Reordered imports alphabetically for better maintainability - Improved delay comment to avoid coupling to ConversationMachine's value - Clarified refinement response handling comment - Improved single-segment skip comment to explain "no additional context" 2. TranscriptionModule.ts: - Documented all refinement events (started, completed, failed) - Moved refinement:started event emission to outer function to avoid multiple emissions on retry attempts 3. AudioSegmentPersistence.ts: - Simplified boolean coercion for better clarity All tests passing (609 tests). Build successful. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <[email protected]>

…tion modules - Added documentation for dual-phase dictation transcription in CLAUDE.md. - Simplified comments in TranscriptionModule.ts regarding audio upload for refinement. - Clarified comments in AudioSegmentPersistence.ts and DictationMachine.ts for better understanding of audio segment handling and refinement processes. These changes enhance documentation clarity and improve code maintainability.

rosscado and others added 4 commits October 26, 2025 11:11

chatgpt-codex-connector bot reviewed Oct 26, 2025

View reviewed changes

src/state-machines/DictationMachine.ts Outdated Show resolved Hide resolved

rosscado and others added 4 commits October 26, 2025 11:30

Merge branch 'main' into feature/gh-256-dual-phase-transcription

63d1dd2

rosscado force-pushed the feature/gh-256-dual-phase-transcription branch from feea380 to 59500eb Compare October 26, 2025 11:41

rosscado requested a review from Copilot October 26, 2025 11:43

Copilot AI reviewed Oct 26, 2025

View reviewed changes

rosscado added 4 commits October 26, 2025 16:44

rosscado requested a review from Copilot October 27, 2025 18:50

Copilot AI reviewed Oct 27, 2025

View reviewed changes

Update src/state-machines/DictationMachine.ts

37b4ca6

Co-authored-by: Copilot <[email protected]>

rosscado and others added 6 commits October 27, 2025 19:37

Update src/state-machines/DictationMachine.ts

f2d0644

Co-authored-by: Copilot <[email protected]>

Update src/state-machines/DictationMachine.ts

3b036d2

Co-authored-by: Copilot <[email protected]>

Update src/audio/AudioSegmentPersistence.ts

357fcf0

Co-authored-by: Copilot <[email protected]>

Update src/TranscriptionModule.ts

e3fa8dc

Co-authored-by: Copilot <[email protected]>

feat: implement dual-phase contextual transcription for dictation (GH-256) #259

Are you sure you want to change the base?

feat: implement dual-phase contextual transcription for dictation (GH-256) #259

Uh oh!

Conversation

rosscado commented Oct 26, 2025

Summary

Phase 1 - Live Streaming (unchanged)

Phase 2 - Contextual Refinement (new)

Key Features

Privacy & Memory Management

Code Changes

Benefits

Testing

Future Enhancements (out of scope)

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

rosscado commented Oct 26, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rosscado commented Oct 26, 2025

Uh oh!

rosscado commented Oct 26, 2025

✅ All Copilot code review feedback addressed

Changes in commit ae3e7bd:

Uh oh!

rosscado commented Oct 26, 2025

PR-259 Review Comments (cheetah)

Overall Review

Critical Issues

Improvements Needed

Inline Comments

Comment 1: Line 24 - Type Export

Comment 2: Line 401 - Missing Test Coverage (CRITICAL)

Comment 3: Line 520 - Silent Failure Case

Comment 4: Line 1446 - Type Safety Issue

Comment 5: Line 1482 - Inconsistent Spacing Logic

Comment 6: Line 1982 - Potential Memory Leak

Comment 7: Line 2018 - Edge Case Missing

Comment 8: Line 2031 - Magic Number

Comment 9: Test File - No Tests Exist (CRITICAL)

Test Strategy Recommendations

Unit Tests Needed

Mock Strategy

Positive Observations

Checklist Before Merge

Must Fix (Blocking)

Should Fix (Non-blocking)

Nice to Have

Estimated Effort

Final Recommendation

Uh oh!

rosscado commented Oct 26, 2025

✅ All Cheetah code review feedback addressed

Critical Issues - Resolved

1. ✅ Test Coverage (CRITICAL - was blocking merge)

2. ❌ "Memory Leak" in Refinement Mapping (Comment 6)

3. ❌ Type Safety Issue (Comment 4)

Improvements Made

4. ✅ Inconsistent Spacing Logic (Comment 5)

5. ✅ Magic Numbers (Comment 8)

Rejected Suggestions

6. ❌ Warning Log When Frames Missing (Comment 3)

7. ❌ Handle No Segments Case (Comment 7)

Summary

Uh oh!

rosscado commented Oct 26, 2025

Changes in commit `ae3e7bd`:

Core Feature Implementation (`24a9d5f`)

Type Safety & Code Quality (`e45fd7e`, `6168c19`, `7fa17a2`)

Bug Fixes (`33069fc`, `59500eb`)

Copilot Review Improvements (`ae3e7bd`)

Cheetah Review Improvements (`a4e1d63`)