diff --git a/CLAUDE.md b/CLAUDE.md index dfd77b179d..0901d35bb1 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -59,6 +59,7 @@ For detailed browser and feature compatibility across different chatbot sites, s - `AudioModule.js` - Main audio coordination and state management - `OffscreenAudioBridge.js` - Communication bridge between content script and offscreen audio processing - `AudioInputMachine.ts`, `AudioOutputMachine.ts` - State machines for audio input/output flow + - **Dictation transcription**: Uses dual-phase approach (live streaming + refinement) - see [doc/DUAL_PHASE_TRANSCRIPTION.md](doc/DUAL_PHASE_TRANSCRIPTION.md) 3. **Voice Activity Detection** (`src/vad/`) - `OffscreenVADClient.ts` - Content script client for VAD communication diff --git a/doc/DUAL_PHASE_TRANSCRIPTION.md b/doc/DUAL_PHASE_TRANSCRIPTION.md new file mode 100644 index 0000000000..070e45510c --- /dev/null +++ b/doc/DUAL_PHASE_TRANSCRIPTION.md @@ -0,0 +1,298 @@ +# Dual-Phase Contextual Transcription for Dictation + +This document describes the two-phase transcription system used in dictation mode to balance real-time responsiveness with high accuracy. + +## Overview + +In dictation mode, each dictation target (form field or input element) receives transcriptions through two distinct phases: + +1. **Phase 1 (Live Streaming)**: Fast, incremental transcription of speech as it's captured +2. **Phase 2 (Refinement)**: High-accuracy re-transcription of accumulated audio with full context + +## Phase 1: Live Streaming + +### Purpose +Provide immediate visual feedback to the user as they speak, creating a responsive real-time experience. + +### Characteristics +- **Speed**: Low-latency transcription (typically < 1 second from speech to display) +- **Accuracy**: Lower accuracy due to limited audio and contextual information +- **Audio**: Short bursts (typically 1-3 seconds per segment) +- **Context Sent**: Each request includes: + - Text transcripts of preceding segments (for continuity) + - Target field's label and input type (for domain context) + - Sequence number (for ordering and intelligent merging) + +### Sequence Tracking +Each live segment is assigned an **incremental sequence number** (positive integers starting from 1). This allows: +- **Ordering**: Segments can be stitched together in the correct order even if responses arrive out-of-sequence +- **Merging**: The API server can merge consecutive segments intelligently using their sequence numbers +- **Target Mapping**: Each sequence number is associated with the specific input element it was dictated to + +### Implementation +See [DictationMachine.ts:1246-1303](../src/state-machines/DictationMachine.ts#L1246-L1303) for the `userSpeaking` state handling, and [TranscriptionModule.ts:203-309](../src/TranscriptionModule.ts#L203-L309) for the `uploadAudioWithRetry` function that sends live segments. + +--- + +## Phase 2: Refinement + +### Purpose +Re-transcribe accumulated audio with maximum context to achieve significantly higher accuracy. + +### Characteristics +- **Speed**: Higher latency (3-10 seconds depending on accumulated audio length) +- **Accuracy**: Significantly higher accuracy due to full audio context +- **Audio**: All unrefined audio captured for this target since last refinement +- **Context Sent**: + - **Complete audio only** (no text transcripts, no sequence numbers) + - This is a standalone transcription request to the stateless `/transcribe` API + - The audio itself contains all necessary context + +### Request Tracking +Refinement requests use **UUID-based tracking** (not sequence numbers): +- Each refinement gets a unique `requestId` (UUID v4) +- Tracked separately via `context.pendingRefinements` Map +- No global sequence counter involvement +- Responses are handled synchronously in Promise callbacks (not via event bus) + +### Refinement Triggers +A refinement request is sent when **ALL** of the following conditions are met: + +1. **Minimum segments**: Two or more unrefined live segments have accumulated in the target field's buffer, AND +2. **Endpoint detection**: One of these events occurs: + - **EOS (End-of-Speech)**: The app and transcription API implicitly determine the user has probably finished speaking + - **Field Switch**: User tabs or clicks to a different target field + - **Session End**: User ends the dictation session ("hang up") + +### Refinement Targets +The refinement response: +- **Replaces** all previously transcribed text from live segments in that target field +- **Preserves** any pre-existing text that was in the field before the dictation session started +- Only affects the specific target field that was active when the refined audio was captured + +### Multiple Refinement Passes +**Important**: A given dictation target may receive **multiple refinement passes** before field switch or session end. + +**Why?** Because EOS is an implicit prediction: +- If EOS is detected but the user resumes speaking (false positive), another EOS event will eventually occur +- Each EOS event triggers a refinement request (if ≥2 unrefined segments exist) +- Each successive refinement includes **more audio** than the previous one +- Each refinement still **replaces all prior live segment transcripts** (and may also replace a previous refinement) + +**Example Timeline:** +``` +User dictates → EOS detected → Refinement #1 (segments 1-3) +User resumes → EOS detected → Refinement #2 (segments 1-6, includes previous + new) +User switches field → End of refinements for this target +``` + +### Audio Buffering +- Audio segments are buffered per target in `context.audioSegmentsByTarget` +- Maximum buffer size: **120 seconds** (2 minutes) per target to prevent unbounded memory growth +- When limit is reached, oldest segments are automatically trimmed +- Buffers persist across multiple EOS events (enabling multiple refinement passes) +- Buffers are cleared when: + - User switches to a different target field + - Dictation session ends + - Manual edit is detected (triggers session termination) + +### Implementation +See: +- [DictationMachine.ts:1943-2063](../src/state-machines/DictationMachine.ts#L1943-L2063) for `performContextualRefinement` action +- [DictationMachine.ts:375-450](../src/state-machines/DictationMachine.ts#L375-L450) for `handleRefinementComplete` function +- [TranscriptionModule.ts:329-403](../src/TranscriptionModule.ts#L329-L403) for `uploadAudioForRefinement` function + +--- + +## Endpoint Detection + +### EOS (End-of-Speech) Detection +The system uses a **probability-based endpoint detection** mechanism: + +- After each transcription, the API returns `pFinishedSpeaking` (probability user finished speaking) and `tempo` (speech pace) +- A dynamic delay is calculated using these signals (see [DictationMachine.ts:2104-2146](../src/state-machines/DictationMachine.ts#L2104-L2146)) +- Maximum delay for dictation: **8 seconds** (`REFINEMENT_MAX_DELAY_MS`) + - Longer than prompt-based interactions (no AI waiting for input) + - Reduces premature refinement from brief pauses during continuous dictation + +### State Machine Integration +The refinement trigger is managed by XState: +- State: `listening.converting.accumulating` ([DictationMachine.ts:1346-1377](../src/state-machines/DictationMachine.ts#L1346-L1377)) +- After `refinementDelay` timeout, transitions to `refining` state if `refinementConditionsMet` guard passes +- Guard checks: `context.refinementPendingForTargets.size > 0 && !context.isTranscribing` + +--- + +## Data Structures + +### Context Fields + +```typescript +// Phase 1 (Live Streaming) - per sequence number +transcriptions: Record // Global transcriptions (all targets) +transcriptionsByTarget: Record> // Grouped by target ID +transcriptionTargets: Record // Maps sequence → target element +provisionalTranscriptionTarget?: { // Pre-upload target mapping + sequenceNumber: number; + element: HTMLElement; +} + +// Phase 2 (Refinement) - per target ID +audioSegmentsByTarget: Record // Audio buffers by target +refinementPendingForTargets: Set // Target IDs awaiting refinement +pendingRefinements: Map +``` + +### AudioSegment Structure +```typescript +interface AudioSegment { + blob: Blob; // WAV audio blob + frames: Float32Array; // Raw PCM audio data (for concatenation) + duration: number; // Milliseconds + sequenceNumber: number; // Original Phase 1 sequence number + captureTimestamp?: number; // When captured by VAD +} +``` + +--- + +## Key Distinctions + +| Aspect | Phase 1 (Live) | Phase 2 (Refinement) | +|--------|---------------|---------------------| +| **Purpose** | Real-time feedback | High accuracy | +| **Audio Length** | 1-3 seconds | Up to 120 seconds | +| **Context** | Preceding transcripts + field metadata | Audio only | +| **Tracking** | Sequence numbers (integers) | Request IDs (UUIDs) | +| **API Fields** | `sequenceNumber`, `messages`, `inputType`, `inputLabel` | `requestId` only | +| **Response Route** | Event bus → state machine | Promise callback → direct handler | +| **Frequency** | After each VAD segment | After EOS/field switch/session end | +| **Multiple Passes** | One per segment | Potentially multiple per target | + +--- + +## Error Handling + +### Phase 1 Failures +- Retry logic with exponential backoff (up to 3 attempts) +- On terminal failure, emit `saypi:transcribeFailed` event +- State machine transitions to error state, then returns to listening after 3 seconds + +### Phase 2 Failures +- Same retry logic (up to 3 attempts) +- On terminal failure: + - Emit `saypi:refinement:failed` event + - Clean up refinement metadata + - **Audio buffers are preserved** (may retry on next EOS) + - Phase 1 transcripts remain visible to user (graceful degradation) + +--- + +## Example Flow + +``` +1. User starts dictating into Field A + → [Phase 1] Segment 1 → "Hello" (seq 1) + → [Phase 1] Segment 2 → "world" (seq 2) + +2. Brief pause (EOS detected) + → [Phase 2] Refinement #1 (segments 1-2) → "Hello, world!" + → Replaces "Hello world" with "Hello, world!" + +3. User resumes dictating + → [Phase 1] Segment 3 → "how are" (seq 3) + → [Phase 1] Segment 4 → "you" (seq 4) + +4. Another pause (EOS detected) + → [Phase 2] Refinement #2 (segments 1-4) → "Hello, world! How are you?" + → Replaces entire field text + +5. User tabs to Field B + → Final refinement for Field A completes (if needed) + → Capture initial text for Field B + → Continue with new Phase 1 segments +``` + +--- + +## Related Files + +### Core Implementation +- [src/state-machines/DictationMachine.ts](../src/state-machines/DictationMachine.ts) - State machine orchestration +- [src/TranscriptionModule.ts](../src/TranscriptionModule.ts) - Upload logic for both phases +- [src/audio/AudioSegmentPersistence.ts](../src/audio/AudioSegmentPersistence.ts) - Audio segment storage utilities + +### Supporting Modules +- [src/TranscriptMergeService.ts](../src/TranscriptMergeService.ts) - Local transcript merging +- [src/text-insertion/TextInsertionManager.ts](../src/text-insertion/TextInsertionManager.ts) - DOM text insertion +- [src/TimerModule.ts](../src/TimerModule.ts) - Endpoint delay calculation + +--- + +## Configuration + +### Constants (DictationMachine.ts) +- `MAX_AUDIO_BUFFER_DURATION_MS = 120000` - Maximum audio buffer per target (2 minutes) +- `REFINEMENT_MAX_DELAY_MS = 8000` - Maximum delay for EOS detection (8 seconds) + +### User Preferences +- `transcriptionMode` - STT model preference (passed to both Phase 1 and Phase 2) +- `removeFillerWords` - Filter filler words (applied in both phases) +- `keepSegments` - Debug option to save audio files to disk + +--- + +## Testing Considerations + +When testing dual-phase transcription: + +1. **Phase 1 Accuracy**: Test with short phrases to verify live streaming responsiveness +2. **Phase 2 Accuracy**: Test with longer utterances and verify refinement improves accuracy +3. **Multiple Refinements**: Test false-positive EOS scenarios (brief pauses mid-sentence) +4. **Field Switching**: Verify refinements complete for previous field when switching +5. **Buffer Limits**: Test 120-second limit with extended dictation +6. **Error Recovery**: Test network failures during each phase +7. **Manual Edits**: Verify manual edits terminate dictation and clear buffers + +### Mock Requirements +- Mock Chrome extension APIs (`chrome.runtime.sendMessage`) +- Mock EventBus for Phase 1 events +- Mock TranscriptionModule functions for API responses +- Use JSDOM for DOM manipulation testing + +--- + +## Performance Notes + +### Memory Management +- Audio buffers automatically trim when exceeding 120s per target +- Refinement metadata cleaned up after completion/failure +- Phase 1 transcripts cleared when replaced by Phase 2 + +### Network Optimization +- Phase 1: Many small requests (optimized for latency) +- Phase 2: Fewer large requests (optimized for accuracy) +- No duplicate audio uploads (Phase 2 uses buffered segments) + +### User Experience +- Live streaming provides immediate feedback (no "dead air") +- Refinements improve accuracy without user intervention +- Multiple refinement passes handle natural speech pauses +- Pre-existing text preserved across refinements + +--- + +## Future Enhancements + +Potential improvements to the dual-phase system: + +1. **Incremental Refinement**: Only re-transcribe new segments since last refinement +2. **Adaptive Buffering**: Adjust 120s limit based on available memory +3. **Confidence Scoring**: Display visual indicators for Phase 1 vs Phase 2 text +4. **Smart EOS**: Improve endpoint detection using linguistic features +5. **Batch Refinement**: Refine multiple targets in a single request diff --git a/src/TranscriptionModule.ts b/src/TranscriptionModule.ts index 9cfbab5675..771f21d294 100644 --- a/src/TranscriptionModule.ts +++ b/src/TranscriptionModule.ts @@ -210,6 +210,7 @@ export async function uploadAudioWithRetry( clientReceiveTimestamp?: number, inputType?: string, inputLabel?: string, + onSequenceNumber?: (sequenceNumber: number) => void, ): Promise { let retryCount = 0; let delay = 1000; // initial delay of 1 second @@ -240,6 +241,16 @@ export async function uploadAudioWithRetry( while (retryCount < maxRetries) { try { usedSequenceNumber = transcriptionSent(); + if (onSequenceNumber) { + try { + onSequenceNumber(usedSequenceNumber); + } catch (callbackError) { + logger.error( + "[TranscriptionModule] onSequenceNumber callback threw an error", + callbackError + ); + } + } await uploadAudio( audioBlob, audioDurationMillis, @@ -297,6 +308,194 @@ export async function uploadAudioWithRetry( throw new Error("Max retries reached"); } +/** + * Upload audio for refinement (Phase 2). + * Uses UUID tracking instead of sequence numbers. No precedingTranscripts sent. + */ +export async function uploadAudioForRefinement( + audioBlob: Blob, + audioDurationMillis: number, + requestId: string, + sessionId?: string, + maxRetries: number = 3 +): Promise { + let retryCount = 0; + let delay = 1000; // initial delay of 1 second + const transcriptionStartTimestamp = Date.now(); + + // Emit refinement started event (moved to outer function to avoid multiple emissions on retry) + EventBus.emit("saypi:refinement:started", { + requestId, + timestamp: transcriptionStartTimestamp, + audioDurationMs: audioDurationMillis, + audioBytes: audioBlob.size, + }); + + const sleep = (ms: number) => + new Promise((resolve) => setTimeout(resolve, ms)); + + while (retryCount < maxRetries) { + try { + const transcriptionText = await uploadAudioForRefinementInternal( + audioBlob, + audioDurationMillis, + requestId, + sessionId, + transcriptionStartTimestamp + ); + + // Emit refinement-specific completion event + EventBus.emit("saypi:refinement:completed", { + requestId, + text: transcriptionText, + }); + + return transcriptionText; + } catch (error) { + // check for timeout errors (30s on Heroku) + if ( + error instanceof TypeError && + knownNetworkErrorMessages.includes(error.message) + ) { + logger.info( + `[Refinement ${requestId}] Attempt ${retryCount + 1}/${maxRetries} failed. Retrying in ${ + delay / 1000 + } seconds...` + ); + await sleep(delay); + + // Exponential backoff + delay *= 2; + + retryCount++; + } else { + console.error(`[Refinement ${requestId}] Unexpected error:`, error); + // Emit refinement-specific failure event + EventBus.emit("saypi:refinement:failed", { + requestId, + error, + }); + throw error; // Re-throw non-network errors to exit the retry loop + } + } + } + + logger.error(`[Refinement ${requestId}] Max retries reached. Giving up.`); + EventBus.emit("saypi:refinement:failed", { + requestId, + error: new Error("Max retries reached"), + }); + throw new Error("Max retries reached"); +} + +/** + * Internal refinement upload (bare-bones request). + * No sequence numbers, precedingTranscripts, or acceptsMerge. + */ +async function uploadAudioForRefinementInternal( + audioBlob: Blob, + audioDurationMillis: number, + requestId: string, + sessionId?: string, + transcriptionStartTimestamp?: number +): Promise { + try { + const chatbot = await ChatbotService.getChatbot(); + + // Build minimal FormData (no sequence number, no messages, no acceptsMerge) + const formData = new FormData(); + let audioFilename = "audio.webm"; + if (audioBlob.type === "audio/mp4") { + audioFilename = "audio.mp4"; + } else if (audioBlob.type === "audio/wav") { + audioFilename = "audio.wav"; + } + + formData.append("audio", audioBlob, audioFilename); + formData.append("duration", (audioDurationMillis / 1000).toString()); + formData.append("requestId", requestId); // UUID for correlation + + if (sessionId) { + formData.append("sessionId", sessionId); + } + + // Add minimal usage metadata + try { + const usageMeta = await buildUsageMetadata(chatbot); + if (usageMeta.clientId) formData.append("clientId", usageMeta.clientId); + if (usageMeta.version) formData.append("version", usageMeta.version); + if (usageMeta.app) formData.append("app", usageMeta.app); + if (usageMeta.language) formData.append("language", usageMeta.language); + } catch (error) { + logger.warn(`[Refinement ${requestId}] Failed to add usage metadata:`, error); + } + + // Get user preferences for transcription + const preference = userPreferences.getCachedTranscriptionMode(); + if (preference) { + formData.append("prefer", preference); + } + + // Remove filler words if enabled + const removeFiller = userPreferences.getCachedRemoveFillerWords(); + if (removeFiller) { + formData.append("removeFillerWords", "true"); + } + + logger.debug( + `[Refinement ${requestId}] Uploading ${(audioBlob.size / 1024).toFixed(2)}kb of audio` + ); + + const controller = new AbortController(); + const { signal } = controller; + setTimeout(() => controller.abort(), TIMEOUT_MS); + + const startTime = Date.now(); + + // Build URL params + const usageMeta = await buildUsageMetadata(chatbot); + const params = new URLSearchParams(); + if (usageMeta.app) params.set("app", usageMeta.app); + if (usageMeta.language) params.set("language", usageMeta.language); + + const response = await callApi( + `${config.apiServerUrl}/transcribe${params.toString() ? `?${params.toString()}` : ""}`, + { + method: "POST", + body: formData, + signal, + } + ); + + if (!response.ok) { + throw new Error(`HTTP ${response.status}: ${response.statusText}`); + } + + const responseJson = await response.json(); + const endTime = Date.now(); + const transcriptionDurationMillis = endTime - startTime; + const transcript = responseJson.text; + const wc = transcript.split(" ").length; + + logger.debug( + `[Refinement ${requestId}] Transcribed ${Math.round( + audioDurationMillis / 1000 + )}s of audio into ${wc} words in ${Math.round( + transcriptionDurationMillis / 1000 + )}s` + ); + + if (transcript.length === 0) { + logger.warn(`[Refinement ${requestId}] Received empty transcription`); + } + + return transcript; + } catch (error) { + logger.error(`[Refinement ${requestId}] Upload failed:`, error); + throw error; + } +} + async function uploadAudio( audioBlob: Blob, audioDurationMillis: number, diff --git a/src/UniversalDictationModule.ts b/src/UniversalDictationModule.ts index 60c684f90d..bb39e5c3e6 100644 --- a/src/UniversalDictationModule.ts +++ b/src/UniversalDictationModule.ts @@ -1,6 +1,12 @@ import { Observation } from "./dom/Observation"; import { addChild } from "./dom/DOMModule"; -import { createDictationMachine } from "./state-machines/DictationMachine"; +import { + createDictationMachine, + DictationTranscribedEvent, + DictationSpeechStoppedEvent, + DictationAudioConnectedEvent, + DictationSessionAssignedEvent, +} from "./state-machines/DictationMachine"; import { interpret } from "xstate"; import EventBus from "./events/EventBus.js"; import { IconModule } from "./icons/IconModule"; @@ -328,6 +334,19 @@ export class UniversalDictationModule { }; const hideButton = () => { + // Trigger refinement if dictation is active for this element + if (target.machine) { + const state = target.machine.getSnapshot(); + // Check if machine is in a state where refinement makes sense + if (state.matches("listening")) { + console.debug("[UniversalDictation] Field blur - triggering refinement for element:", element); + target.machine.send({ + type: "saypi:refineTranscription", + targetElement: element, + }); + } + } + // Use setTimeout to delay hiding so click event can fire first setTimeout(() => { if (button && !this.currentActiveTarget) { @@ -387,14 +406,14 @@ export class UniversalDictationModule { element.addEventListener("input", handleContentChange); // Listen for dictation updates to track dictated content - EventBus.on("dictation:contentUpdated", (data) => { + EventBus.on("dictation:contentUpdated", (data: { targetElement: HTMLElement }) => { if (data.targetElement === element) { markDictationUpdate(); } }); - + // Listen for dictation termination due to manual edit - EventBus.on("dictation:terminatedByManualEdit", (data) => { + EventBus.on("dictation:terminatedByManualEdit", (data: { targetElement: HTMLElement; reason: string }) => { if (data.targetElement === element && this.currentActiveTarget?.element === element) { console.debug("Dictation terminated due to manual edit on element:", element); // Clean up the active dictation state @@ -1026,7 +1045,7 @@ export class UniversalDictationModule { // Events with additional data [USER_STOPPED_SPEAKING, AUDIO_DEVICE_CONNECTED, SESSION_ASSIGNED].forEach((eventName) => { - EventBus.on(eventName, (detail) => { + EventBus.on(eventName, (detail: Omit | Omit | Omit) => { if (detail) { // sanitise the detail object to replace any `frames` property with `[REDACTED]` const sanitisedDetail = { ...detail }; @@ -1042,7 +1061,7 @@ export class UniversalDictationModule { }); // Listen for transcription events - EventBus.on("saypi:transcription:completed", (detail) => { + EventBus.on("saypi:transcription:completed", (detail: Omit) => { logger.debug(`[UniversalDictationModule] Forwarding transcription to dictation machine`, detail); dictationService.send({ type: "saypi:transcribed", ...detail }); }); @@ -1052,6 +1071,17 @@ export class UniversalDictationModule { dictationService.send("saypi:transcribeFailed"); }); + // Listen for refinement events (Phase 2 dual-phase transcription) + // Refinements are handled internally by DictationMachine via Promise callbacks + // These listeners are for telemetry/debugging only + EventBus.on("saypi:refinement:completed", (detail: {requestId: string, text: string}) => { + logger.debug(`[UniversalDictationModule] Refinement ${detail.requestId} completed: ${detail.text.substring(0, 50)}...`); + }); + + EventBus.on("saypi:refinement:failed", (detail: {requestId: string, error: any}) => { + logger.warn(`[UniversalDictationModule] Refinement ${detail.requestId} failed:`, detail.error); + }); + EventBus.on("saypi:transcribedEmpty", () => { logger.debug(`[UniversalDictationModule] Forwarding empty transcription to dictation machine`); dictationService.send("saypi:transcribedEmpty"); diff --git a/src/audio/AudioSegmentPersistence.ts b/src/audio/AudioSegmentPersistence.ts new file mode 100644 index 0000000000..e771798da8 --- /dev/null +++ b/src/audio/AudioSegmentPersistence.ts @@ -0,0 +1,49 @@ +/** + * Shared utility for persisting audio segments to disk for debugging. + * Used by AudioInputMachine (VAD segments) and DictationMachine (refinement chunks). + */ + +import { config } from "../ConfigModule"; + +/** + * Persists an audio blob to disk via the background script's downloads API. + * Only saves if the `keepSegments` config is enabled. + * + * @param audioBlob - The audio blob to save + * @param captureTimestamp - When the audio was captured (or refinement started) + * @param duration - Duration of the audio in milliseconds + * @param prefix - Filename prefix (e.g., "saypi-segment" or "saypi-refinement") + */ +export function persistAudioSegment( + audioBlob: Blob, + captureTimestamp: number, + duration: number, + prefix: string = "saypi-segment" +): void { + try { + const keep = config.keepSegments === true || config.keepSegments === 'true'; + if (!keep || audioBlob.size === 0) { + return; + } + + // Create a unique filename with timestamps + const startedAt = captureTimestamp - Math.round(duration); + const startedIso = new Date(startedAt).toISOString().replace(/[:.]/g, "-"); + const endedIso = new Date(captureTimestamp).toISOString().replace(/[:.]/g, "-"); + const filename = `SayPiSegments/${prefix}_${startedIso}_to_${endedIso}_${Math.round(duration)}ms.wav`; + + const reader = new FileReader(); + reader.onloadend = () => { + const base64Data = (reader.result as string).split(",")[1]; + // Send to background to save via downloads API + chrome.runtime.sendMessage({ + type: "SAVE_SEGMENT_WAV", + filename, + base64: base64Data + }, () => void 0); + }; + reader.readAsDataURL(audioBlob); + } catch (e) { + console.warn(`Failed to persist ${prefix} locally:`, e); + } +} diff --git a/src/state-machines/AudioInputMachine.ts b/src/state-machines/AudioInputMachine.ts index 98c4afa7a2..c655a2d8f8 100644 --- a/src/state-machines/AudioInputMachine.ts +++ b/src/state-machines/AudioInputMachine.ts @@ -11,7 +11,7 @@ import { logger } from "../LoggingModule"; import { likelySupportsOffscreen, getBrowserInfo } from "../UserAgentModule"; import { VADPreset } from "../vad/VADConfigs"; import { ChatbotIdentifier } from "../chatbots/ChatbotIdentifier"; -import { config } from "../ConfigModule"; +import { persistAudioSegment } from "../audio/AudioSegmentPersistence"; setupInterceptors(); @@ -137,29 +137,7 @@ EventBus.on("saypi:userStoppedSpeaking", (data: { logger.debug(`Reconstructed Blob size: ${audioBlob.size} bytes`); // Optionally persist the segment if keepSegments is enabled - try { - const keep = String(config.keepSegments || '').toLowerCase() === 'true'; - if (keep && audioBlob.size > 0) { - // Create a unique filename with timestamps - const startedAt = data.captureTimestamp - Math.round(data.duration); - const startedIso = new Date(startedAt).toISOString().replace(/[:.]/g, "-"); - const endedIso = new Date(data.captureTimestamp).toISOString().replace(/[:.]/g, "-"); - const filename = `SayPiSegments/saypi-segment_${startedIso}_to_${endedIso}_${Math.round(data.duration)}ms.wav`; - const reader = new FileReader(); - reader.onloadend = () => { - const base64Data = (reader.result as string).split(",")[1]; - // Send to background to save via downloads API - chrome.runtime.sendMessage({ - type: "SAVE_SEGMENT_WAV", - filename, - base64: base64Data - }, () => void 0); - }; - reader.readAsDataURL(audioBlob); - } - } catch (e) { - console.warn("Failed to persist segment locally:", e); - } + persistAudioSegment(audioBlob, data.captureTimestamp, data.duration, "saypi-segment"); // Emit both blob and duration for transcription EventBus.emit("audio:dataavailable", { diff --git a/src/state-machines/ConversationMachine.ts b/src/state-machines/ConversationMachine.ts index e3e904dd63..58903d8108 100644 --- a/src/state-machines/ConversationMachine.ts +++ b/src/state-machines/ConversationMachine.ts @@ -1096,7 +1096,10 @@ const machine = createMachine; // Global transcriptions for backwards compatibility @@ -87,6 +117,7 @@ interface DictationContext { userIsSpeaking: boolean; timeUserStoppedSpeaking: number; timeUserStartedSpeaking: number; // Track when current speech started + timeLastTranscriptionReceived: number; // Track when last transcription was received (for endpoint timing) sessionId?: string; targetElement?: HTMLElement; // The input field being dictated to accumulatedText: string; // Text accumulated during this dictation session @@ -105,6 +136,16 @@ interface DictationContext { * that we always know which target the very first portion of audio belongs to. */ speechStartTarget?: HTMLElement; + + // Phase 2 (Refinement) - See doc/DUAL_PHASE_TRANSCRIPTION.md + audioSegmentsByTarget: Record; + refinementPendingForTargets: Set; + pendingRefinements: Map; } // Define the state schema @@ -130,6 +171,7 @@ type DictationStateSchema = { states: { transcribing: {}; accumulating: {}; + refining: {}; }; }; }; @@ -328,6 +370,84 @@ function getTranscriptionsForTarget(context: DictationContext, targetElement: HT return context.transcriptionsByTarget[targetId] || {}; } +/** + * Handle completion of a refinement request (Phase 2). + * Replaces Phase 1 transcriptions with the refined result. + * Note: Multiple passes may occur per target (false-positive EOS detection). + */ +function handleRefinementComplete( + context: DictationContext, + requestId: string, + transcription: string +): void { + const meta = context.pendingRefinements.get(requestId); + if (!meta) { + console.warn(`[DictationMachine] No metadata found for refinement ${requestId} - already cleaned up?`); + return; + } + + const { targetId, targetElement, segmentCount } = meta; + + console.debug( + `[DictationMachine] Received refinement transcription [${requestId}] for target ${targetId}: ${transcription}` + ); + + // Normalize the refined transcription + transcription = normalizeTranscriptionText(transcription); + + // Get Phase 1 sequences for this target + const oldTranscriptions = context.transcriptionsByTarget[targetId] || {}; + const phase1Sequences = Object.keys(oldTranscriptions) + .map(k => parseInt(k, 10)) + .filter(seq => seq > 0); // Only clear Phase 1 (positive sequences), not previous refinements (negative keys) + + // Clear Phase 1 transcriptions from global storage + phase1Sequences.forEach(seq => { + delete context.transcriptions[seq]; + delete context.transcriptionTargets[seq]; + }); + + console.debug( + `[DictationMachine] Cleared ${phase1Sequences.length} Phase 1 transcriptions: [${phase1Sequences.join(', ')}]` + ); + + // Store refinement result using negative timestamp as key (avoids collision with Phase 1 sequences) + const refinementKey = -(Date.now()); + context.transcriptionsByTarget[targetId] = { + [refinementKey]: transcription + }; + context.transcriptions[refinementKey] = transcription; + context.transcriptionTargets[refinementKey] = targetElement; + + // Calculate final text (initial + refinement) + const initialText = context.initialTextByTarget[targetId] || ""; + const finalText = smartJoinTwoTexts(initialText, transcription); + + setTextInTarget(finalText, targetElement, true); // Replace all content + + // Update accumulated text if this is the current target + if (targetElement === context.targetElement) { + context.accumulatedText = finalText; + } + + // Clean up refinement metadata + context.pendingRefinements.delete(requestId); + + // Emit refinement complete event + EventBus.emit("dictation:refined", { + targetElement, + targetId, + requestId, + refinedText: transcription, + finalText, + segmentCount + }); + + console.debug( + `[DictationMachine] Refinement ${requestId} complete for target ${targetId}. Final text: ${finalText}` + ); +} + function mapTargetForSequence( context: DictationContext, expectedSequenceNumber: number, @@ -357,6 +477,108 @@ function mapTargetForSequence( return finalTarget; } +// Maximum audio buffer per target (120s) - prevents unbounded memory growth +const MAX_AUDIO_BUFFER_DURATION_MS = 120000; + +// Maximum delay for refinement endpoint detection (8s) +// Longer than prompt-based interactions to reduce premature refinements during continuous dictation +const REFINEMENT_MAX_DELAY_MS = 8000; + +/** + * Store audio segment for later refinement (Phase 2). + * Buffers accumulate up to MAX_AUDIO_BUFFER_DURATION_MS and persist across EOS events. + */ +function storeAudioSegment( + context: DictationContext, + targetElement: HTMLElement, + blob: Blob, + frames: Float32Array, + duration: number, + sequenceNumber: number, + captureTimestamp?: number +): void { + const targetId = getTargetElementId(targetElement); + + // Initialize array for this target if it doesn't exist + if (!context.audioSegmentsByTarget[targetId]) { + context.audioSegmentsByTarget[targetId] = []; + } + + const segments = context.audioSegmentsByTarget[targetId]; + + // Calculate total duration including the new segment + const currentTotalDuration = segments.reduce((sum, seg) => sum + seg.duration, 0); + const newTotalDuration = currentTotalDuration + duration; + + // If adding this segment would exceed the max buffer duration, trim old segments + if (newTotalDuration > MAX_AUDIO_BUFFER_DURATION_MS) { + let excessDuration = newTotalDuration - MAX_AUDIO_BUFFER_DURATION_MS; + let segmentsToRemove = 0; + + // Remove oldest segments until we're under the limit + for (let i = 0; i < segments.length && excessDuration > 0; i++) { + excessDuration -= segments[i].duration; + segmentsToRemove++; + } + + if (segmentsToRemove > 0) { + const removed = segments.splice(0, segmentsToRemove); + console.debug( + `Trimmed ${segmentsToRemove} old audio segments for target ${targetId} to stay under ${MAX_AUDIO_BUFFER_DURATION_MS}ms limit. ` + + `Removed ${removed.reduce((sum, seg) => sum + seg.duration, 0)}ms of audio.` + ); + } + } + + // Store the new segment + segments.push({ + blob, + frames, + duration, + sequenceNumber, + captureTimestamp, + }); + + // Mark this target as pending refinement + context.refinementPendingForTargets.add(targetId); + + const totalDuration = segments.reduce((sum, seg) => sum + seg.duration, 0); + console.debug( + `Stored audio segment ${sequenceNumber} for target ${targetId}. Total: ${segments.length} segments, ${(totalDuration / 1000).toFixed(1)}s of audio` + ); +} + +/** + * Clear audio buffers for a specific target element. + * @param context - The dictation context + * @param targetId - The target element ID to clear buffers for + */ +function clearAudioForTarget(context: DictationContext, targetId: string): void { + delete context.audioSegmentsByTarget[targetId]; + context.refinementPendingForTargets.delete(targetId); + + // Clear any pending refinements for this target (UUID-based tracking) + for (const [requestId, meta] of context.pendingRefinements.entries()) { + if (meta.targetId === targetId) { + context.pendingRefinements.delete(requestId); + console.debug(`Cleared pending refinement ${requestId} for target ${targetId}`); + } + } + + console.debug(`Cleared audio buffers for target ${targetId}`); +} + +/** + * Clear all audio buffers. + * @param context - The dictation context + */ +function clearAllAudioBuffers(context: DictationContext): void { + context.audioSegmentsByTarget = {}; + context.refinementPendingForTargets.clear(); + context.pendingRefinements.clear(); + console.debug('Cleared all audio buffers'); +} + /** * Common helper for preparing and uploading an audio segment. */ @@ -368,7 +590,8 @@ function uploadAudioSegment( sessionId?: string, maxRetries: number = 3, captureTimestamp?: number, - clientReceiveTimestamp?: number + clientReceiveTimestamp?: number, + frames?: Float32Array ) { const expectedSequenceNumber = getCurrentSequenceNumber() + 1; const finalTarget = mapTargetForSequence( @@ -387,10 +610,23 @@ function uploadAudioSegment( )}` ); - // Extract input context for dictation mode + // Extract input context for dictation mode (Phase 1) const { inputType, inputLabel } = getInputContext(finalTarget); console.debug(`Input context for transcription: type="${inputType}", label="${inputLabel}"`); + // Store audio segment for Phase 2 refinement if frames are available + if (frames) { + storeAudioSegment( + context, + finalTarget, + audioBlob, + frames, + duration, + expectedSequenceNumber, + captureTimestamp + ); + } + uploadAudioWithRetry( audioBlob, duration, @@ -400,7 +636,14 @@ function uploadAudioSegment( captureTimestamp, clientReceiveTimestamp, inputType || undefined, - inputLabel || undefined + inputLabel || undefined, + (sequenceNum) => { + // Keep transcription target mapping in sync even if sequence numbers shift + if (sequenceNum !== expectedSequenceNumber) { + delete context.transcriptionTargets[expectedSequenceNumber]; + } + context.transcriptionTargets[sequenceNum] = finalTarget; + } ).then((sequenceNum) => { console.debug(`Sent transcription ${sequenceNum} to target`, finalTarget); if (sequenceNum !== expectedSequenceNumber) { @@ -860,11 +1103,15 @@ const machine = createMachine(), + pendingRefinements: new Map(), }, id: "dictation", initial: "idle", @@ -1086,6 +1333,13 @@ const machine = createMachine { + // Update the timestamp for endpoint detection + context.timeLastTranscriptionReceived = Date.now(); + let transcription = event.text; const sequenceNumber = event.sequenceNumber; const mergedSequences = event.merged || []; + + // NOTE: Refinement responses bypass event bus (handled in performContextualRefinement). + // This handler ONLY processes Phase 1 (live streaming) transcriptions. + // ---- NORMALISE ELLIPSES ---- // Convert any ellipsis—either the single Unicode "…" character or the // three-dot sequence "..." — into a single space so downstream merging // sees consistent whitespace. Then collapse *spaces or tabs* (but not // line breaks) and trim the string. const originalTranscription = transcription; - transcription = transcription - .replace(/\u2026/g, " ") // "…" → space - .replace(/\.{3}/g, " ") // "..." → space - .replace(/[ \t]{2,}/g, " ") // collapse runs of spaces/tabs but keep line-breaks - .trim(); + transcription = normalizeTranscriptionText(transcription); console.debug( `Dictation transcript [${sequenceNumber}]: ${transcription}` + @@ -1374,29 +1657,36 @@ const machine = createMachine undefined, accumulatedText: "", transcriptionTargets: () => ({}), provisionalTranscriptionTarget: () => undefined, targetSwitchesDuringSpeech: () => undefined, speechStartTarget: () => undefined, + audioSegmentsByTarget: () => ({}), + refinementPendingForTargets: () => new Set(), + pendingRefinements: () => new Map(), }), finalizeDictation: (context: DictationContext) => { // Generate final merged text from current target's transcriptions let finalText = context.accumulatedText; - + if (context.targetElement) { const targetTranscriptions = getTranscriptionsForTarget(context, context.targetElement); finalText = computeFinalText(targetTranscriptions, [], finalText, "", false); } - + + // Clear all audio buffers when dictation is finalized + clearAllAudioBuffers(context); + // Emit event that dictation is complete EventBus.emit("dictation:complete", { targetElement: context.targetElement, text: finalText, }); - + console.log("Dictation completed for target:", context.targetElement, "with text:", finalText); }, @@ -1504,7 +1794,8 @@ const machine = createMachine { + console.debug("[DictationMachine] performContextualRefinement triggered"); + + // Determine which target(s) to refine + let targetsToRefine: HTMLElement[] = []; + + if (event.type === "saypi:refineTranscription") { + // Explicit refinement request for a specific target + targetsToRefine = [(event as DictationRefineTranscriptionEvent).targetElement]; + } else { + // Endpoint-triggered: refine ALL pending targets (not just current one) + // This handles the case where user switched targets mid-dictation + for (const targetId of context.refinementPendingForTargets) { + // Find the target element by looking through transcription targets + const targetElement = Object.values(context.transcriptionTargets).find( + el => getTargetElementId(el) === targetId + ); + + if (targetElement) { + targetsToRefine.push(targetElement); + } else { + console.warn(`[DictationMachine] No element found for pending refinement target ${targetId}`); + context.refinementPendingForTargets.delete(targetId); + } + } + } + + if (targetsToRefine.length === 0) { + console.debug("[DictationMachine] No targets to refine"); + return; + } + + // Process each target + for (const targetElement of targetsToRefine) { + const targetId = getTargetElementId(targetElement); + const segments = context.audioSegmentsByTarget[targetId]; + + if (!segments || segments.length === 0) { + console.debug(`[DictationMachine] No audio segments to refine for target ${targetId}`); + // Clear the pending flag even if no segments (cleanup) + context.refinementPendingForTargets.delete(targetId); + continue; + } + + // Skip refinement if only 1 segment (no additional context for improvement) + if (segments.length === 1) { + console.debug(`[DictationMachine] Skipping refinement for target ${targetId} until more segments arrive`); + continue; // Keep buffering, don't clear + } + + // Remove pending flag now that refinement is in-flight to avoid duplicate submissions + context.refinementPendingForTargets.delete(targetId); + + // Always refine ALL segments for maximum context (full contextual refinement) + console.debug( + `[DictationMachine] Starting refinement for target ${targetId} with ${segments.length} segments (full context)` + ); + + // Concatenate ALL audio segments from session start + const totalDuration = segments.reduce((sum, seg) => sum + seg.duration, 0); + const totalFrames = segments.reduce((sum, seg) => sum + seg.frames.length, 0); + + // Combine all frames into a single Float32Array + const combinedFrames = new Float32Array(totalFrames); + let offset = 0; + for (const segment of segments) { + combinedFrames.set(segment.frames, offset); + offset += segment.frames.length; + } + + // Convert combined frames to WAV blob + const combinedBlob = convertToWavBlob(combinedFrames); + + console.debug( + `[DictationMachine] Concatenated ${segments.length} segments: ${totalDuration}ms, ${totalFrames} frames, ${combinedBlob.size} bytes` + ); + + // For logging, treat the capture timestamp as the time we initiate refinement. + // Refinement intentionally reuses the *current* timestamp so Telemetry treats the + // Phase 2 upload as new work rather than flagging the earlier capture delay. + const refinementStartTimestamp = Date.now(); + + // Optionally persist the refinement chunk if keepSegments is enabled + persistAudioSegment(combinedBlob, refinementStartTimestamp, totalDuration, "saypi-refinement"); + + // Generate UUID for this refinement request (separate from Phase 1 sequence tracking) + const requestId = crypto.randomUUID(); + + // Track refinement metadata independently + context.pendingRefinements.set(requestId, { + targetId, + targetElement, + segmentCount: segments.length, + timestamp: refinementStartTimestamp + }); + + console.debug( + `[DictationMachine] Refinement ${requestId} initiated for target ${targetId}` + ); + + // Upload using bare-bones refinement function (no sequence number tracking) + uploadAudioForRefinement( + combinedBlob, + totalDuration, + requestId, + context.sessionId, + 3 // max retries + ).then((transcriptionText) => { + // Handle response inline (no event bus routing needed) + handleRefinementComplete(context, requestId, transcriptionText); + }).catch((error) => { + console.error(`[DictationMachine] Refinement ${requestId} failed:`, error); + // Clean up metadata on failure + context.pendingRefinements.delete(requestId); + // Note: We don't clear audio segments - they may be retried later + }); + } // end for loop over targetsToRefine + }, }, services: {}, guards: { @@ -1650,8 +2068,69 @@ const machine = createMachine { + // Check if we have pending refinements and not currently transcribing + return context.refinementPendingForTargets.size > 0 && !context.isTranscribing; + }, + hasSegmentsForRefinement: (context: DictationContext, event: DictationEvent) => { + if (event.type !== "saypi:refineTranscription") { + return false; + } + const targetElement = + (event as DictationRefineTranscriptionEvent).targetElement ?? context.targetElement; + if (!targetElement) { + return false; + } + const targetId = getTargetElementId(targetElement); + const segments = context.audioSegmentsByTarget[targetId]; + return Array.isArray(segments) && segments.length > 0; + }, + }, + delays: { + refinementDelay: (context: DictationContext, event: DictationEvent) => { + // Only calculate delay for transcription events + if (event.type !== "saypi:transcribed") { + return 0; + } + + const transcriptionEvent = event as DictationTranscribedEvent; + + // Use configured max delay for dictation endpoint detection + const maxDelay = REFINEMENT_MAX_DELAY_MS; + + // Use pFinishedSpeaking from API, default to 1 if not provided + let probabilityFinished = transcriptionEvent.pFinishedSpeaking ?? 1; + + // Use tempo from API, default to 0 if not provided (neutral) + let tempo = transcriptionEvent.tempo ?? 0; + // Clamp tempo to [0, 1] + tempo = Math.max(0, Math.min(1, tempo)); + + const scheduledAt = Date.now(); + const timeElapsed = scheduledAt - context.timeLastTranscriptionReceived; + const finalDelay = calculateDelay( + context.timeLastTranscriptionReceived, + probabilityFinished, + tempo, + maxDelay + ); + + console.debug( + "[DictationMachine] refinementDelay:", + JSON.stringify({ + seq: transcriptionEvent.sequenceNumber, + pFinished: probabilityFinished, + tempo, + maxDelay, + timeElapsed, + finalDelay, + scheduledAt + }) + ); + + return finalDelay; + }, }, - delays: {}, } ); @@ -1660,4 +2139,4 @@ export function createDictationMachine(targetElement?: HTMLElement) { return machine; } -export { machine as DictationMachine }; \ No newline at end of file +export { machine as DictationMachine }; diff --git a/test/state-machines/DictationMachine-Refinement.spec.ts b/test/state-machines/DictationMachine-Refinement.spec.ts new file mode 100644 index 0000000000..34dc6896a2 --- /dev/null +++ b/test/state-machines/DictationMachine-Refinement.spec.ts @@ -0,0 +1,1194 @@ +import { describe, it, expect, vi, beforeEach, afterEach, beforeAll } from 'vitest'; +import { interpret } from 'xstate'; +import EventBus from '../../src/events/EventBus.js'; + +// Mock dependencies +vi.mock('../../src/TranscriptionModule', () => ({ + uploadAudioWithRetry: vi.fn((...args: any[]) => { + const callback = args[9]; + if (typeof callback === 'function') { + callback(1); + } + return Promise.resolve(1); + }), + uploadAudioForRefinement: vi.fn((blob, duration, requestId) => { + // Mock refinement upload - returns full transcription of audio + return Promise.resolve('Refined transcription text'); + }), + isTranscriptionPending: vi.fn(() => false), + clearPendingTranscriptions: vi.fn(), + getCurrentSequenceNumber: vi.fn(() => 0), +})); + +vi.mock('../../src/ConfigModule', () => ({ + config: { + apiServerUrl: 'http://localhost:3000', + }, +})); + +vi.mock('../../src/prefs/PreferenceModule', () => ({ + UserPreferenceModule: { + getInstance: () => ({ + getLanguage: vi.fn(() => Promise.resolve('en')), + }), + }, +})); + +vi.mock('../../src/error-management/TranscriptionErrorManager', () => ({ + default: { + recordAttempt: vi.fn(), + }, +})); + +vi.mock('../../src/TranscriptMergeService', () => ({ + TranscriptMergeService: vi.fn().mockImplementation(() => ({ + mergeTranscriptsLocal: vi.fn((transcripts) => { + return Object.keys(transcripts) + .sort((a, b) => parseInt(a) - parseInt(b)) + .map(key => transcripts[key]) + .join(' '); + }), + })), +})); + +vi.mock('../../src/audio/AudioEncoder', () => ({ + convertToWavBlob: vi.fn((frames: Float32Array) => { + // Return a mock blob with size proportional to frame count + return new Blob([new ArrayBuffer(frames.length * 4)], { type: 'audio/wav' }); + }), +})); + +vi.mock('../../src/TimerModule', () => ({ + calculateDelay: vi.fn(() => 100), // Short delay for testing +})); + +// Mock EventBus +vi.spyOn(EventBus, 'emit'); + +// Import the machine after mocks are set up +import { createDictationMachine } from '../../src/state-machines/DictationMachine'; +import * as TranscriptionModule from '../../src/TranscriptionModule'; +import * as AudioEncoder from '../../src/audio/AudioEncoder'; +import * as TimerModule from '../../src/TimerModule'; + +const resolveUpload = (sequence: number) => (...args: any[]) => { + const callback = args[9] as ((seq: number) => void) | undefined; + if (typeof callback === 'function') { + callback(sequence); + } + return Promise.resolve(sequence); +}; + +/** + * NOTE: These tests were updated for UUID-based refinement tracking. + * + * Key changes from sequence-based approach: + * - Refinements no longer use sequence numbers or `saypi:transcribed` events + * - Refinements are uploaded via `uploadAudioForRefinement()` (not `uploadAudioWithRetry()`) + * - Refinement responses are handled via Promise callbacks (not event bus) + * - Refinement tracking uses `pendingRefinements` Map (requestId → metadata) + * - Refinement transcriptions use negative keys to avoid collision with Phase 1 sequences + * + * To test refinement completion: + * 1. Trigger refinement with `saypi:refineTranscription` event + * 2. Wait for `uploadAudioForRefinement` Promise to resolve + * 3. Check `pendingRefinements` Map is cleared + * 4. Check `transcriptionsByTarget` has negative key with refinement text + * 5. Check Phase 1 sequences (positive keys) are deleted + */ +describe('DictationMachine - Dual-Phase Refinement', () => { + let service: any; + let inputElement: HTMLInputElement; + + beforeAll(() => { + inputElement = document.createElement('input'); + inputElement.id = 'test-input'; + inputElement.name = 'testField'; + inputElement.placeholder = 'Test input'; + }); + + beforeEach(() => { + vi.clearAllMocks(); + + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockClear(); + vi.mocked(TranscriptionModule.uploadAudioForRefinement).mockClear(); + vi.mocked(TranscriptionModule.uploadAudioForRefinement).mockImplementation((blob, duration, requestId) => { + // Default: Return full refined transcription + return Promise.resolve('refined transcription'); + }); + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(0); + vi.mocked(EventBus.emit).mockClear(); + vi.mocked(AudioEncoder.convertToWavBlob).mockClear(); + vi.mocked(TimerModule.calculateDelay).mockReturnValue(100); + + inputElement.value = ''; + + const machine = createDictationMachine(); + service = interpret(machine); + }); + + afterEach(() => { + if (service) { + service.stop(); + } + }); + + describe('Audio Segment Buffering', () => { + it('should buffer audio segments when frames are provided', async () => { + service.start(); + + // Start dictation + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + // Speaking events with frames + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(2)); + + // Stop speaking - should buffer the audio + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + // Verify audio was uploaded for Phase 1 + expect(TranscriptionModule.uploadAudioWithRetry).toHaveBeenCalled(); + + // Check that the context has buffered audio + const state = service.getSnapshot(); + const targetId = `${inputElement.id || inputElement.name}`; + expect(state.context.audioSegmentsByTarget[targetId]).toBeDefined(); + expect(state.context.audioSegmentsByTarget[targetId].length).toBe(1); + }); + + it('should not buffer audio segments when frames are missing', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + service.send({ type: 'saypi:userSpeaking' }); + + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(2)); + + // Stop speaking WITHOUT frames + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + // No frames parameter + }); + + // Check that no audio was buffered + const state = service.getSnapshot(); + const targetId = `${inputElement.id || inputElement.name}`; + expect(state.context.audioSegmentsByTarget[targetId]).toBeUndefined(); + }); + + it('should trim old segments when exceeding 120s buffer limit', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + // Add 13 segments of 10 seconds each (130s total, exceeds 120s limit) + for (let i = 0; i < 13; i++) { + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(10000); + const mockBlob = new Blob([new ArrayBuffer(40000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(i + 1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(i + 2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 10000, // 10 seconds + blob: mockBlob, + frames: mockFrames, + }); + + // Simulate transcription response + service.send({ + type: 'saypi:transcribed', + text: `segment ${i}`, + sequenceNumber: i + 2, + }); + } + + // Check that buffer was trimmed + const state = service.getSnapshot(); + const targetId = `${inputElement.id || inputElement.name}`; + const segments = state.context.audioSegmentsByTarget[targetId]; + + expect(segments).toBeDefined(); + // Should have trimmed the first segment (10s) to stay under 120s + expect(segments.length).toBeLessThanOrEqual(12); + + // Calculate total duration + const totalDuration = segments.reduce((sum: number, seg: any) => sum + seg.duration, 0); + expect(totalDuration).toBeLessThanOrEqual(120000); + }); + }); + + describe('Refinement Delay Calculation', () => { + it('should calculate refinement delay based on pFinishedSpeaking and tempo', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + // Send transcription with endpoint indicators + service.send({ + type: 'saypi:transcribed', + text: 'hello world', + sequenceNumber: 2, + pFinishedSpeaking: 0.9, + tempo: 0.5, + }); + + // Verify calculateDelay was called with correct parameters + expect(TimerModule.calculateDelay).toHaveBeenCalled(); + }); + + it('should not trigger refinement if refinementConditionsMet guard fails', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + // Don't add any audio segments + + // Try to trigger refinement manually + service.send({ + type: 'saypi:refineTranscription', + targetElement: inputElement, + }); + + // Should NOT upload refinement because no segments exist + const uploadCalls = vi.mocked(TranscriptionModule.uploadAudioWithRetry).mock.calls; + const refinementCalls = uploadCalls.filter(call => { + // Refinement calls have empty precedingTranscripts + return Object.keys(call[2] || {}).length === 0; + }); + + expect(refinementCalls.length).toBe(0); + }); + }); + + describe('Audio Concatenation', () => { + it('should concatenate multiple audio segments correctly', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + // Add 3 segments + const segmentCount = 3; + for (let i = 0; i < segmentCount; i++) { + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(i * 2 + 1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(i * 2 + 2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + service.send({ + type: 'saypi:transcribed', + text: `segment ${i}`, + sequenceNumber: i * 2 + 2, + }); + } + + // Clear the mock to track refinement upload + vi.mocked(AudioEncoder.convertToWavBlob).mockClear(); + + // Trigger refinement manually + service.send({ + type: 'saypi:refineTranscription', + targetElement: inputElement, + }); + + // Wait for async operations + await new Promise(resolve => setTimeout(resolve, 50)); + + // Verify convertToWavBlob was called with concatenated frames + expect(AudioEncoder.convertToWavBlob).toHaveBeenCalled(); + const concatenatedFrames = vi.mocked(AudioEncoder.convertToWavBlob).mock.calls[0][0]; + + // Should have combined 3 segments of 1000 frames each = 3000 frames + expect(concatenatedFrames.length).toBe(3000); + + // Verify refinement upload was called (UUID-based, no sequence numbers) + expect(TranscriptionModule.uploadAudioForRefinement).toHaveBeenCalled(); + const uploadCall = vi.mocked(TranscriptionModule.uploadAudioForRefinement).mock.calls[0]; + + // uploadAudioForRefinement has signature: (blob, duration, requestId, sessionId, maxRetries) + // Check duration parameter (index 1) + expect(uploadCall[1]).toBe(3000); // Total duration should be 3000ms + + // Check requestId is a UUID string (index 2) + expect(typeof uploadCall[2]).toBe('string'); + + // Check blob was passed (index 0) + expect(uploadCall[0]).toBeInstanceOf(Blob); + }); + }); + + describe('Refinement Response Handling', () => { + it('should replace Phase 1 transcriptions with refined result (full contextual refinement)', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + // Add 2 Phase 1 segments + for (let i = 0; i < 2; i++) { + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(i * 2 + 1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(i * 2 + 2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + service.send({ + type: 'saypi:transcribed', + text: `phase1 ${i}`, + sequenceNumber: i * 2 + 2, + }); + } + + // Trigger refinement (UUID-based, no sequence number) + service.send({ + type: 'saypi:refineTranscription', + targetElement: inputElement, + }); + + // Wait for refinement Promise to resolve + await new Promise(resolve => setTimeout(resolve, 50)); + + // Check that Phase 1 transcriptions were REPLACED (not appended) + const state = service.getSnapshot(); + const targetId = `${inputElement.id || inputElement.name}`; + + // Refinement should be stored with negative key (not sequence number) + const transcriptionKeys = Object.keys(state.context.transcriptionsByTarget[targetId] || {}).map(k => parseInt(k, 10)); + + // Should have exactly 1 transcription (refinement with negative key) + expect(transcriptionKeys.length).toBe(1); + expect(transcriptionKeys[0]).toBeLessThan(0); // Negative timestamp key + + // Phase 1 transcriptions should be deleted from global storage + expect(state.context.transcriptions[2]).toBeUndefined(); + expect(state.context.transcriptions[4]).toBeUndefined(); + + // Refinement should remain in global storage + const refinementKey = transcriptionKeys[0]; + expect(state.context.transcriptions[refinementKey]).toBe('refined transcription'); + + // Pending refinement metadata should be cleared + expect(state.context.pendingRefinements.size).toBe(0); + }); + + it('should emit dictation:refined event on successful refinement', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + // Add 2 segments (need >=2 for refinement) + for (let i = 0; i < 2; i++) { + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(i * 2 + 1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(i * 2 + 2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + service.send({ + type: 'saypi:transcribed', + text: `phase1 segment ${i}`, + sequenceNumber: i * 2 + 2, + }); + } + + // Clear EventBus mock to track refinement event + vi.mocked(EventBus.emit).mockClear(); + + // Trigger refinement + service.send({ + type: 'saypi:refineTranscription', + targetElement: inputElement, + }); + + // Wait for refinement Promise to resolve + await new Promise(resolve => setTimeout(resolve, 50)); + + // Verify dictation:refined event was emitted + const refinedEvents = vi.mocked(EventBus.emit).mock.calls.filter( + call => call[0] === 'dictation:refined' + ); + + expect(refinedEvents.length).toBeGreaterThan(0); + expect(refinedEvents[0][1]).toMatchObject({ + targetElement: inputElement, + refinedText: 'refined transcription', // From default mock in beforeEach + }); + }); + }); + + describe('State Transitions', () => { + it('should transition from accumulating to refining on explicit trigger', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + service.send({ + type: 'saypi:transcribed', + text: 'hello', + sequenceNumber: 2, + }); + + // Should be in accumulating state + expect(service.getSnapshot().matches({ listening: { converting: 'accumulating' } })).toBe(true); + + // Trigger refinement + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(100)); + + service.send({ + type: 'saypi:refineTranscription', + targetElement: inputElement, + }); + + // Should transition to refining state + expect(service.getSnapshot().matches({ listening: { converting: 'refining' } })).toBe(true); + }); + + it('should return to accumulating after refinement response', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + service.send({ + type: 'saypi:transcribed', + text: 'hello', + sequenceNumber: 2, + }); + + // Trigger refinement + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(100)); + + service.send({ + type: 'saypi:refineTranscription', + targetElement: inputElement, + }); + + await new Promise(resolve => setTimeout(resolve, 10)); + + // Should be in refining + expect(service.getSnapshot().matches({ listening: { converting: 'refining' } })).toBe(true); + + // Send refinement response + service.send({ + type: 'saypi:transcribed', + text: 'refined', + sequenceNumber: 100, + }); + + // Should return to accumulating + expect(service.getSnapshot().matches({ listening: { converting: 'accumulating' } })).toBe(true); + }); + }); + + describe('Buffer Cleanup', () => { + it('should clear audio buffers when target switches', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + // Verify buffer exists + const targetId1 = `${inputElement.id || inputElement.name}`; + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId1]).toBeDefined(); + + // Create new input element + const inputElement2 = document.createElement('input'); + inputElement2.id = 'test-input-2'; + + // Switch target + service.send({ type: 'saypi:switchTarget', targetElement: inputElement2 }); + + // Original target's buffer should still exist (not cleared on switch) + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId1]).toBeDefined(); + }); + + it('should keep audio buffers after refinement (for incremental refinement)', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + service.send({ + type: 'saypi:transcribed', + text: 'hello', + sequenceNumber: 2, + }); + + const targetId = `${inputElement.id || inputElement.name}`; + + // Verify buffer exists + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId]).toBeDefined(); + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId].length).toBe(1); + + // Trigger refinement + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(100)); + + service.send({ + type: 'saypi:refineTranscription', + targetElement: inputElement, + }); + + // Wait for refinement Promise to resolve + await new Promise(resolve => setTimeout(resolve, 50)); + + // Audio buffer should still exist (kept for future refinements) + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId]).toBeDefined(); + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId].length).toBe(1); + }); + }); + + describe('Error Scenarios', () => { + it('should handle refinement failure gracefully', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + // Add 2 segments (need >=2 for refinement to trigger) + for (let i = 0; i < 2; i++) { + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(i * 2 + 1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(i * 2 + 2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + service.send({ + type: 'saypi:transcribed', + text: `segment ${i}`, + sequenceNumber: i * 2 + 2, + }); + } + + // Clear the mock and set up rejection BEFORE triggering refinement + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockReset(); + + // Make refinement upload fail + vi.mocked(TranscriptionModule.uploadAudioForRefinement).mockRejectedValue(new Error('Upload failed')); + + // Manually trigger refinement + service.send({ + type: 'saypi:refineTranscription', + targetElement: inputElement, + }); + + // Wait for async error handling + await new Promise(resolve => setTimeout(resolve, 100)); + + // With UUID-based tracking, audio buffers are NOT cleared on refinement failure + // (they may be retried later) + const targetId = `${inputElement.id || inputElement.name}`; + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId]).toBeDefined(); + + // But pending refinement metadata should be cleaned up + expect(service.getSnapshot().context.pendingRefinements.size).toBe(0); + }); + + it('should handle missing target element in refinement response', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + service.send({ + type: 'saypi:transcribed', + text: 'hello', + sequenceNumber: 2, + }); + + // Trigger refinement + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(100)); + + service.send({ + type: 'saypi:refineTranscription', + targetElement: inputElement, + }); + + await new Promise(resolve => setTimeout(resolve, 10)); + + // Manually delete the target mapping to simulate the missing element case + const state = service.getSnapshot(); + delete state.context.transcriptionTargets[100]; + + // Send refinement response with missing target + service.send({ + type: 'saypi:transcribed', + text: 'refined', + sequenceNumber: 100, + }); + + // Should handle gracefully without crashing + // The refinement should be discarded + const targetId = `${inputElement.id || inputElement.name}`; + expect(service.getSnapshot().context.transcriptionsByTarget[targetId]).not.toEqual({ + 100: 'refined', + }); + }); + }); + + describe('Incremental Refinement', () => { + it('should skip refinement for single segment until more arrive', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + service.send({ + type: 'saypi:transcribed', + text: 'segment 0', + sequenceNumber: 2, + }); + + // Clear upload mock to track refinement attempts + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockClear(); + + // Try to trigger refinement with only 1 segment + service.send({ + type: 'saypi:refineTranscription', + targetElement: inputElement, + }); + + await new Promise(resolve => setTimeout(resolve, 10)); + + // Should NOT upload refinement (only 1 segment) + expect(TranscriptionModule.uploadAudioWithRetry).not.toHaveBeenCalled(); + + // Buffer should still exist + const targetId = `${inputElement.id || inputElement.name}`; + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId]).toBeDefined(); + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId].length).toBe(1); + }); + + it('should refine ALL segments with full context in subsequent passes', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + // Add 2 segments + for (let i = 0; i < 2; i++) { + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(i * 2 + 1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(i * 2 + 2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + service.send({ + type: 'saypi:transcribed', + text: `segment ${i}`, + sequenceNumber: i * 2 + 2, + }); + } + + // First refinement pass + vi.mocked(AudioEncoder.convertToWavBlob).mockClear(); + + service.send({ + type: 'saypi:refineTranscription', + targetElement: inputElement, + }); + + await new Promise(resolve => setTimeout(resolve, 50)); // Wait for refinement Promise + + // Should have concatenated 2 segments (2000 frames) + expect(AudioEncoder.convertToWavBlob).toHaveBeenCalled(); + let concatenatedFrames = vi.mocked(AudioEncoder.convertToWavBlob).mock.calls[0][0]; + expect(concatenatedFrames.length).toBe(2000); + + // Add one more segment + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames3 = new Float32Array(1000); + const mockBlob3 = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(5); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(6)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob3, + frames: mockFrames3, + }); + + service.send({ + type: 'saypi:transcribed', + text: 'segment 2', + sequenceNumber: 6, + }); + + // Second refinement pass - should refine ALL 3 segments (full context) + vi.mocked(AudioEncoder.convertToWavBlob).mockClear(); + + service.send({ + type: 'saypi:refineTranscription', + targetElement: inputElement, + }); + + await new Promise(resolve => setTimeout(resolve, 50)); // Wait for refinement Promise + + // Should have concatenated ALL 3 segments (3000 frames) for full context + expect(AudioEncoder.convertToWavBlob).toHaveBeenCalled(); + concatenatedFrames = vi.mocked(AudioEncoder.convertToWavBlob).mock.calls[0][0]; + expect(concatenatedFrames.length).toBe(3000); + + // Check final state after second refinement completes + const state = service.getSnapshot(); + const targetId = `${inputElement.id || inputElement.name}`; + + // Should have exactly 1 transcription (latest refinement with negative key) + const transcriptionKeys = Object.keys(state.context.transcriptionsByTarget[targetId] || {}).map(k => parseInt(k, 10)); + expect(transcriptionKeys.length).toBe(1); + expect(transcriptionKeys[0]).toBeLessThan(0); // Negative timestamp key + + // Latest refinement text should be stored (from mock default) + const refinementKey = transcriptionKeys[0]; + expect(state.context.transcriptions[refinementKey]).toBe('refined transcription'); + }); + }); + + describe('Concurrent Refinements', () => { + it('should handle concurrent refinements for multiple targets', async () => { + service.start(); + + const inputElement2 = document.createElement('input'); + inputElement2.id = 'test-input-2'; + + // Start dictation on first target + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + // Add segment to first target + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames1 = new Float32Array(1000); + const mockBlob1 = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob1, + frames: mockFrames1, + }); + + service.send({ + type: 'saypi:transcribed', + text: 'target1 text', + sequenceNumber: 2, + }); + + // Switch to second target + service.send({ type: 'saypi:switchTarget', targetElement: inputElement2 }); + + // Add segment to second target + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames2 = new Float32Array(1000); + const mockBlob2 = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(3); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(4)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob2, + frames: mockFrames2, + }); + + service.send({ + type: 'saypi:transcribed', + text: 'target2 text', + sequenceNumber: 4, + }); + + // Both targets should have buffers + const targetId1 = `${inputElement.id || inputElement.name}`; + const targetId2 = `${inputElement2.id || inputElement2.name}`; + + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId1]).toBeDefined(); + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId2]).toBeDefined(); + + // Both should be pending refinement + expect(service.getSnapshot().context.refinementPendingForTargets.has(targetId1)).toBe(true); + expect(service.getSnapshot().context.refinementPendingForTargets.has(targetId2)).toBe(true); + }); + + it('should refine all pending targets even after target switch (Codex bug)', async () => { + service.start(); + + const inputElement2 = document.createElement('input'); + inputElement2.id = 'test-input-2'; + inputElement2.name = 'testField2'; + + // Start dictation on first target + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + // Add TWO segments to first target (need >=2 for refinement) + for (let i = 0; i < 2; i++) { + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(i * 2 + 1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(i * 2 + 2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + service.send({ + type: 'saypi:transcribed', + text: `target1 segment ${i}`, + sequenceNumber: i * 2 + 2, + pFinishedSpeaking: 0.9, // High probability - will trigger endpoint delay + tempo: 0.5, + }); + } + + const targetId1 = `${inputElement.id || inputElement.name}`; + + // Verify target1 is pending refinement + expect(service.getSnapshot().context.refinementPendingForTargets.has(targetId1)).toBe(true); + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId1]).toBeDefined(); + + // NOW SWITCH TO SECOND TARGET (this is the bug scenario) + service.send({ type: 'saypi:switchTarget', targetElement: inputElement2 }); + + // Verify current target is now target2 + expect(service.getSnapshot().context.targetElement).toBe(inputElement2); + + // Mock refinement upload for target1 + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockClear(); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(100)); + + // Wait for endpoint delay to trigger refinement + await new Promise(resolve => setTimeout(resolve, 150)); + + // The bug was: refinement would check context.targetElement (now target2) + // and find no segments, leaving target1 unrefined + // The fix: iterate over refinementPendingForTargets instead + + // Verify refinement was triggered for target1 (not current target) using UUID-based approach + expect(TranscriptionModule.uploadAudioForRefinement).toHaveBeenCalled(); + + const uploadCall = vi.mocked(TranscriptionModule.uploadAudioForRefinement).mock.calls[0]; + + // uploadAudioForRefinement has signature: (blob, duration, requestId, sessionId, maxRetries) + expect(uploadCall[0]).toBeInstanceOf(Blob); // blob + expect(uploadCall[1]).toBeGreaterThan(0); // duration + expect(typeof uploadCall[2]).toBe('string'); // requestId (UUID) + expect(uploadCall[3]).toBe('test-session'); // sessionId + + // Wait for refinement Promise to resolve (handled internally) + await new Promise(resolve => setTimeout(resolve, 50)); + + // Verify target1 was refined even though current target is target2 + const state = service.getSnapshot(); + + // Refinement should be stored with negative key + const transcriptionKeys = Object.keys(state.context.transcriptionsByTarget[targetId1] || {}).map(k => parseInt(k, 10)); + expect(transcriptionKeys.length).toBe(1); + expect(transcriptionKeys[0]).toBeLessThan(0); // Negative timestamp key + + // Audio buffer for target1 should still exist (kept for future refinements) + expect(state.context.audioSegmentsByTarget[targetId1]).toBeDefined(); + + // Refinement pending flag should be cleared + expect(state.context.refinementPendingForTargets.has(targetId1)).toBe(false); + + // Pending refinement metadata should be cleared + expect(state.context.pendingRefinements.size).toBe(0); + }); + }); + + describe('Manual Edit Cleanup', () => { + it('should clear audio buffers and refinement state on manual edit (Codex bug)', async () => { + service.start(); + + service.send({ type: 'saypi:startDictation', targetElement: inputElement }); + service.send({ type: 'saypi:callReady' }); + service.send({ type: 'saypi:audio:connected', deviceId: 'test', deviceLabel: 'Test Mic' }); + service.send({ type: 'saypi:session:assigned', session_id: 'test-session' }); + + // Add multiple segments to build up buffer + for (let i = 0; i < 3; i++) { + service.send({ type: 'saypi:userSpeaking' }); + + const mockFrames = new Float32Array(1000); + const mockBlob = new Blob([new ArrayBuffer(4000)]); + + vi.mocked(TranscriptionModule.getCurrentSequenceNumber).mockReturnValue(i * 2 + 1); + vi.mocked(TranscriptionModule.uploadAudioWithRetry).mockImplementationOnce(resolveUpload(i * 2 + 2)); + + service.send({ + type: 'saypi:userStoppedSpeaking', + duration: 1000, + blob: mockBlob, + frames: mockFrames, + }); + + service.send({ + type: 'saypi:transcribed', + text: `segment ${i}`, + sequenceNumber: i * 2 + 2, + }); + } + + const targetId = `${inputElement.id || inputElement.name}`; + + // Verify we have buffered audio and pending refinement + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId]).toBeDefined(); + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId].length).toBe(3); + expect(service.getSnapshot().context.refinementPendingForTargets.has(targetId)).toBe(true); + + // User manually edits the field + service.send({ + type: 'saypi:manualEdit', + targetElement: inputElement, + newContent: 'user typed this', + oldContent: 'segment 0 segment 1 segment 2', + }); + + // The bug was: audio buffers and refinement state were not cleared + // This could lead to stale audio (up to 120s) being refined later + + // Verify audio buffers are cleared + expect(service.getSnapshot().context.audioSegmentsByTarget[targetId]).toBeUndefined(); + + // Verify refinement pending flag is cleared + expect(service.getSnapshot().context.refinementPendingForTargets.has(targetId)).toBe(false); + + // Verify pending refinements are cleared (UUID-based tracking) + expect(service.getSnapshot().context.pendingRefinements.size).toBe(0); + + // Verify transcription state is also cleared (existing behavior) + expect(service.getSnapshot().context.transcriptionsByTarget[targetId]).toBeUndefined(); + + // Should transition to idle + expect(service.getSnapshot().matches('idle')).toBe(true); + }); + }); +}); diff --git a/test/state-machines/DictationMachine-TargetSwitchBreak.spec.ts b/test/state-machines/DictationMachine-TargetSwitchBreak.spec.ts index e040059e17..899844991c 100644 --- a/test/state-machines/DictationMachine-TargetSwitchBreak.spec.ts +++ b/test/state-machines/DictationMachine-TargetSwitchBreak.spec.ts @@ -187,7 +187,8 @@ describe('DictationMachine - Target Switch Audio Breaking', () => { expect.any(Number), expect.any(Number), "text", // inputType from HTML input element - "Enter your name" // inputLabel from placeholder attribute + "Enter your name", // inputLabel from placeholder attribute + expect.any(Function) ); // Verify that only one upload was called (normal processing) @@ -274,4 +275,4 @@ describe('DictationMachine - Target Switch Audio Breaking', () => { expect(service.state.context.targetSwitchesDuringSpeech).toBeUndefined(); }); }); -}); \ No newline at end of file +}); diff --git a/test/state-machines/DictationMachine.spec.ts b/test/state-machines/DictationMachine.spec.ts index 92768624a6..848244d2ab 100644 --- a/test/state-machines/DictationMachine.spec.ts +++ b/test/state-machines/DictationMachine.spec.ts @@ -506,7 +506,8 @@ describe('DictationMachine', () => { expect.any(Number), expect.any(Number), "text", // inputType from HTML input element - "Enter your name" // inputLabel from placeholder attribute + "Enter your name", // inputLabel from placeholder attribute + expect.any(Function) ); }); }); @@ -928,4 +929,4 @@ describe('DictationMachine', () => { }); }); }); -}); \ No newline at end of file +});