-
Notifications
You must be signed in to change notification settings - Fork 6
Description
This enhancement introduces a two-phase transcription process to Say Pi’s Universal Dictation mode.
It preserves low-latency live feedback while greatly improving final transcription accuracy once the user has finished dictating to a field.
Current Behaviour
Universal Dictation streams short audio segments to deliver immediate text updates inside the focused input element.
While responsive, this segmentation limits model context, which can lead to minor inaccuracies, punctuation loss, or awkward phrasing in longer entries.
Proposed Behaviour
The enhanced pipeline adds a contextual second pass triggered automatically when the system determines that dictation for a field is complete.
Phase 1 – Live Streaming
- Short audio windows are transcribed in real time and streamed into the active field.
- Interim text appears immediately, maintaining the natural “live typing” feel.
Phase 2 – Contextual Refinement
Once the user has finished speaking or moved away from the field, the entire buffered audio is re-processed as a single unit.
This produces a refined transcript that benefits from full contextual awareness and replaces the interim text seamlessly.
Trigger Conditions for the Second Phase
-
Field-Change Trigger
- Fired when the user moves focus away from the current form element (tabbing, clicking another field, etc.).
- Handled through the Universal Dictation state machine’s focus-change transition.
- Guarantees that all audio relevant to that field has been captured before launching the final pass.
-
Endpoint Trigger
- Uses Say Pi’s existing endpointing mechanism, already implemented in the chat mode’s conversation state machine.
- The speech-to-text API returns a structured endpoint signal comprising model confidence which is combined locally with several other predictive features (acoustic energy decay, token stability, elapsed time, etc.).
- These features are combined to estimate—over a short future horizon—the probability that the speaker has finished.
- When the conversation state machine determines that the endpoint condition has been met, it emits a high-confidence end-of-turn event, that drives submission of the user's prompt to the LLM.
- This same logic can be reused within Universal Dictation to trigger the contextual second-phase transcription, avoiding new detection logic and maintaining consistent behaviour across modes.
Both triggers converge on the same refinement workflow, with the state machine ensuring that each field receives exactly one contextual re-transcription.
Goals
- Preserve the immediacy of current live dictation.
- Improve accuracy and punctuation through full-context re-analysis.
- Make completion automatic—no button presses or manual confirmation.
- Integrate cleanly with the existing Universal Dictation state machine and reuse the proven endpoint logic from chat mode.
User Experience
- Users speak naturally into any text field.
- Interim words appear instantly as before.
- When dictation ends (by focus change or endpoint signal), the system briefly shows a subtle “Refining…” cue. (optional)
- The text then updates to the final, more accurate version.
- All audio is discarded immediately after refinement for privacy.
Benefits
- Significantly improved transcription fidelity and consistency.
- Seamless correction with minimal visual disruption.
- Maintains hands-free flow across multiple fields.
- Aligns Say Pi dictation quality with dedicated professional transcription tools while remaining lightweight and browser-native.
Privacy
- Audio buffers exist only in memory for the duration of the dictation session.
- Automatically cleared after final refinement.
- No persistent storage or reuse of voice data.