Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
251505c
feat: implement dual-phase contextual transcription for dictation (GH…
rosscado Oct 26, 2025
7fa17a2
fix: add type annotations to EventBus listeners
rosscado Oct 26, 2025
6168c19
fix: add type annotations to remaining EventBus listeners
rosscado Oct 26, 2025
e45fd7e
refactor: export and reuse DictationTranscribedEvent type (DRY)
rosscado Oct 26, 2025
33069fc
fix(i18n): abort on JSON parse failure to prevent data loss
rosscado Oct 26, 2025
24a9d5f
fix: increase audio buffer limit to 120s for full-context refinement
rosscado Oct 26, 2025
63d1dd2
Merge branch 'main' into feature/gh-256-dual-phase-transcription
rosscado Oct 26, 2025
59500eb
fix: map refinement sequence to target for response routing
rosscado Oct 26, 2025
ae3e7bd
refactor: address Copilot code review feedback
rosscado Oct 26, 2025
a4e1d63
refactor: address Cheetah code review feedback
rosscado Oct 26, 2025
2a56558
fix: address Codex code review - target switch and manual edit bugs
rosscado Oct 26, 2025
d02529f
feat: enhance dictation machine with refinement sequence tracking
rosscado Oct 26, 2025
bf0f3d9
feat: document conversation machine with additional parameters for au…
rosscado Oct 26, 2025
e38bd7a
feat: integrate audio segment persistence in state machines
rosscado Oct 27, 2025
c205943
feat: implement dual-phase transcription refinement process
rosscado Oct 27, 2025
37b4ca6
Update src/state-machines/DictationMachine.ts
rosscado Oct 27, 2025
f2d0644
Update src/state-machines/DictationMachine.ts
rosscado Oct 27, 2025
3b036d2
Update src/state-machines/DictationMachine.ts
rosscado Oct 27, 2025
357fcf0
Update src/audio/AudioSegmentPersistence.ts
rosscado Oct 27, 2025
e3fa8dc
Update src/TranscriptionModule.ts
rosscado Oct 27, 2025
4372dee
docs: address Copilot code review feedback on PR-259
rosscado Oct 27, 2025
49361ec
docs: update CLAUDE.md and refine comments in transcription and dicta…
rosscado Oct 27, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,7 @@ For detailed browser and feature compatibility across different chatbot sites, s
- `AudioModule.js` - Main audio coordination and state management
- `OffscreenAudioBridge.js` - Communication bridge between content script and offscreen audio processing
- `AudioInputMachine.ts`, `AudioOutputMachine.ts` - State machines for audio input/output flow
- **Dictation transcription**: Uses dual-phase approach (live streaming + refinement) - see [doc/DUAL_PHASE_TRANSCRIPTION.md](doc/DUAL_PHASE_TRANSCRIPTION.md)

3. **Voice Activity Detection** (`src/vad/`)
- `OffscreenVADClient.ts` - Content script client for VAD communication
Expand Down
298 changes: 298 additions & 0 deletions doc/DUAL_PHASE_TRANSCRIPTION.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,298 @@
# Dual-Phase Contextual Transcription for Dictation

This document describes the two-phase transcription system used in dictation mode to balance real-time responsiveness with high accuracy.

## Overview

In dictation mode, each dictation target (form field or input element) receives transcriptions through two distinct phases:

1. **Phase 1 (Live Streaming)**: Fast, incremental transcription of speech as it's captured
2. **Phase 2 (Refinement)**: High-accuracy re-transcription of accumulated audio with full context

## Phase 1: Live Streaming

### Purpose
Provide immediate visual feedback to the user as they speak, creating a responsive real-time experience.

### Characteristics
- **Speed**: Low-latency transcription (typically < 1 second from speech to display)
- **Accuracy**: Lower accuracy due to limited audio and contextual information
- **Audio**: Short bursts (typically 1-3 seconds per segment)
- **Context Sent**: Each request includes:
- Text transcripts of preceding segments (for continuity)
- Target field's label and input type (for domain context)
- Sequence number (for ordering and intelligent merging)

### Sequence Tracking
Each live segment is assigned an **incremental sequence number** (positive integers starting from 1). This allows:
- **Ordering**: Segments can be stitched together in the correct order even if responses arrive out-of-sequence
- **Merging**: The API server can merge consecutive segments intelligently using their sequence numbers
- **Target Mapping**: Each sequence number is associated with the specific input element it was dictated to

### Implementation
See [DictationMachine.ts:1246-1303](../src/state-machines/DictationMachine.ts#L1246-L1303) for the `userSpeaking` state handling, and [TranscriptionModule.ts:203-309](../src/TranscriptionModule.ts#L203-L309) for the `uploadAudioWithRetry` function that sends live segments.

---

## Phase 2: Refinement

### Purpose
Re-transcribe accumulated audio with maximum context to achieve significantly higher accuracy.

### Characteristics
- **Speed**: Higher latency (3-10 seconds depending on accumulated audio length)
- **Accuracy**: Significantly higher accuracy due to full audio context
- **Audio**: All unrefined audio captured for this target since last refinement
- **Context Sent**:
- **Complete audio only** (no text transcripts, no sequence numbers)
- This is a standalone transcription request to the stateless `/transcribe` API
- The audio itself contains all necessary context

### Request Tracking
Refinement requests use **UUID-based tracking** (not sequence numbers):
- Each refinement gets a unique `requestId` (UUID v4)
- Tracked separately via `context.pendingRefinements` Map
- No global sequence counter involvement
- Responses are handled synchronously in Promise callbacks (not via event bus)

### Refinement Triggers
A refinement request is sent when **ALL** of the following conditions are met:

1. **Minimum segments**: Two or more unrefined live segments have accumulated in the target field's buffer, AND
2. **Endpoint detection**: One of these events occurs:
- **EOS (End-of-Speech)**: The app and transcription API implicitly determine the user has probably finished speaking
- **Field Switch**: User tabs or clicks to a different target field
- **Session End**: User ends the dictation session ("hang up")

### Refinement Targets
The refinement response:
- **Replaces** all previously transcribed text from live segments in that target field
- **Preserves** any pre-existing text that was in the field before the dictation session started
- Only affects the specific target field that was active when the refined audio was captured

### Multiple Refinement Passes
**Important**: A given dictation target may receive **multiple refinement passes** before field switch or session end.

**Why?** Because EOS is an implicit prediction:
- If EOS is detected but the user resumes speaking (false positive), another EOS event will eventually occur
- Each EOS event triggers a refinement request (if ≥2 unrefined segments exist)
- Each successive refinement includes **more audio** than the previous one
- Each refinement still **replaces all prior live segment transcripts** (and may also replace a previous refinement)

**Example Timeline:**
```
User dictates → EOS detected → Refinement #1 (segments 1-3)
User resumes → EOS detected → Refinement #2 (segments 1-6, includes previous + new)
User switches field → End of refinements for this target
```

### Audio Buffering
- Audio segments are buffered per target in `context.audioSegmentsByTarget`
- Maximum buffer size: **120 seconds** (2 minutes) per target to prevent unbounded memory growth
- When limit is reached, oldest segments are automatically trimmed
- Buffers persist across multiple EOS events (enabling multiple refinement passes)
- Buffers are cleared when:
- User switches to a different target field
- Dictation session ends
- Manual edit is detected (triggers session termination)

### Implementation
See:
- [DictationMachine.ts:1943-2063](../src/state-machines/DictationMachine.ts#L1943-L2063) for `performContextualRefinement` action
- [DictationMachine.ts:375-450](../src/state-machines/DictationMachine.ts#L375-L450) for `handleRefinementComplete` function
- [TranscriptionModule.ts:329-403](../src/TranscriptionModule.ts#L329-L403) for `uploadAudioForRefinement` function

---

## Endpoint Detection

### EOS (End-of-Speech) Detection
The system uses a **probability-based endpoint detection** mechanism:

- After each transcription, the API returns `pFinishedSpeaking` (probability user finished speaking) and `tempo` (speech pace)
- A dynamic delay is calculated using these signals (see [DictationMachine.ts:2104-2146](../src/state-machines/DictationMachine.ts#L2104-L2146))
- Maximum delay for dictation: **8 seconds** (`REFINEMENT_MAX_DELAY_MS`)
- Longer than prompt-based interactions (no AI waiting for input)
- Reduces premature refinement from brief pauses during continuous dictation

### State Machine Integration
The refinement trigger is managed by XState:
- State: `listening.converting.accumulating` ([DictationMachine.ts:1346-1377](../src/state-machines/DictationMachine.ts#L1346-L1377))
- After `refinementDelay` timeout, transitions to `refining` state if `refinementConditionsMet` guard passes
- Guard checks: `context.refinementPendingForTargets.size > 0 && !context.isTranscribing`

---

## Data Structures

### Context Fields

```typescript
// Phase 1 (Live Streaming) - per sequence number
transcriptions: Record<number, string> // Global transcriptions (all targets)
transcriptionsByTarget: Record<string, Record<number, string>> // Grouped by target ID
transcriptionTargets: Record<number, HTMLElement> // Maps sequence → target element
provisionalTranscriptionTarget?: { // Pre-upload target mapping
sequenceNumber: number;
element: HTMLElement;
}

// Phase 2 (Refinement) - per target ID
audioSegmentsByTarget: Record<string, AudioSegment[]> // Audio buffers by target
refinementPendingForTargets: Set<string> // Target IDs awaiting refinement
pendingRefinements: Map<string, { // UUID → metadata
targetId: string;
targetElement: HTMLElement;
segmentCount: number;
timestamp: number;
}>
```

### AudioSegment Structure
```typescript
interface AudioSegment {
blob: Blob; // WAV audio blob
frames: Float32Array; // Raw PCM audio data (for concatenation)
duration: number; // Milliseconds
sequenceNumber: number; // Original Phase 1 sequence number
captureTimestamp?: number; // When captured by VAD
}
```

---

## Key Distinctions

| Aspect | Phase 1 (Live) | Phase 2 (Refinement) |
|--------|---------------|---------------------|
| **Purpose** | Real-time feedback | High accuracy |
| **Audio Length** | 1-3 seconds | Up to 120 seconds |
| **Context** | Preceding transcripts + field metadata | Audio only |
| **Tracking** | Sequence numbers (integers) | Request IDs (UUIDs) |
| **API Fields** | `sequenceNumber`, `messages`, `inputType`, `inputLabel` | `requestId` only |
| **Response Route** | Event bus → state machine | Promise callback → direct handler |
| **Frequency** | After each VAD segment | After EOS/field switch/session end |
| **Multiple Passes** | One per segment | Potentially multiple per target |

---

## Error Handling

### Phase 1 Failures
- Retry logic with exponential backoff (up to 3 attempts)
- On terminal failure, emit `saypi:transcribeFailed` event
- State machine transitions to error state, then returns to listening after 3 seconds

### Phase 2 Failures
- Same retry logic (up to 3 attempts)
- On terminal failure:
- Emit `saypi:refinement:failed` event
- Clean up refinement metadata
- **Audio buffers are preserved** (may retry on next EOS)
- Phase 1 transcripts remain visible to user (graceful degradation)

---

## Example Flow

```
1. User starts dictating into Field A
→ [Phase 1] Segment 1 → "Hello" (seq 1)
→ [Phase 1] Segment 2 → "world" (seq 2)

2. Brief pause (EOS detected)
→ [Phase 2] Refinement #1 (segments 1-2) → "Hello, world!"
→ Replaces "Hello world" with "Hello, world!"

3. User resumes dictating
→ [Phase 1] Segment 3 → "how are" (seq 3)
→ [Phase 1] Segment 4 → "you" (seq 4)

4. Another pause (EOS detected)
→ [Phase 2] Refinement #2 (segments 1-4) → "Hello, world! How are you?"
→ Replaces entire field text

5. User tabs to Field B
→ Final refinement for Field A completes (if needed)
→ Capture initial text for Field B
→ Continue with new Phase 1 segments
```

---

## Related Files

### Core Implementation
- [src/state-machines/DictationMachine.ts](../src/state-machines/DictationMachine.ts) - State machine orchestration
- [src/TranscriptionModule.ts](../src/TranscriptionModule.ts) - Upload logic for both phases
- [src/audio/AudioSegmentPersistence.ts](../src/audio/AudioSegmentPersistence.ts) - Audio segment storage utilities

### Supporting Modules
- [src/TranscriptMergeService.ts](../src/TranscriptMergeService.ts) - Local transcript merging
- [src/text-insertion/TextInsertionManager.ts](../src/text-insertion/TextInsertionManager.ts) - DOM text insertion
- [src/TimerModule.ts](../src/TimerModule.ts) - Endpoint delay calculation

---

## Configuration

### Constants (DictationMachine.ts)
- `MAX_AUDIO_BUFFER_DURATION_MS = 120000` - Maximum audio buffer per target (2 minutes)
- `REFINEMENT_MAX_DELAY_MS = 8000` - Maximum delay for EOS detection (8 seconds)

### User Preferences
- `transcriptionMode` - STT model preference (passed to both Phase 1 and Phase 2)
- `removeFillerWords` - Filter filler words (applied in both phases)
- `keepSegments` - Debug option to save audio files to disk

---

## Testing Considerations

When testing dual-phase transcription:

1. **Phase 1 Accuracy**: Test with short phrases to verify live streaming responsiveness
2. **Phase 2 Accuracy**: Test with longer utterances and verify refinement improves accuracy
3. **Multiple Refinements**: Test false-positive EOS scenarios (brief pauses mid-sentence)
4. **Field Switching**: Verify refinements complete for previous field when switching
5. **Buffer Limits**: Test 120-second limit with extended dictation
6. **Error Recovery**: Test network failures during each phase
7. **Manual Edits**: Verify manual edits terminate dictation and clear buffers

### Mock Requirements
- Mock Chrome extension APIs (`chrome.runtime.sendMessage`)
- Mock EventBus for Phase 1 events
- Mock TranscriptionModule functions for API responses
- Use JSDOM for DOM manipulation testing

---

## Performance Notes

### Memory Management
- Audio buffers automatically trim when exceeding 120s per target
- Refinement metadata cleaned up after completion/failure
- Phase 1 transcripts cleared when replaced by Phase 2

### Network Optimization
- Phase 1: Many small requests (optimized for latency)
- Phase 2: Fewer large requests (optimized for accuracy)
- No duplicate audio uploads (Phase 2 uses buffered segments)

### User Experience
- Live streaming provides immediate feedback (no "dead air")
- Refinements improve accuracy without user intervention
- Multiple refinement passes handle natural speech pauses
- Pre-existing text preserved across refinements

---

## Future Enhancements

Potential improvements to the dual-phase system:

1. **Incremental Refinement**: Only re-transcribe new segments since last refinement
2. **Adaptive Buffering**: Adjust 120s limit based on available memory
3. **Confidence Scoring**: Display visual indicators for Phase 1 vs Phase 2 text
4. **Smart EOS**: Improve endpoint detection using linguistic features
5. **Batch Refinement**: Refine multiple targets in a single request
Loading