Add token-level timestamps to StreamingTranscriptionUpdate #132

BrandonWeng · 2025-10-02T14:40:02Z

Closes #120

Adds token-level timestamps to StreamingTranscriptionUpdate for token-level precision in diarization.

Changes

Add tokenTimings property to StreamingTranscriptionUpdate struct
Update StreamingAsrManager to pass tokenTimings from ASRResult
Change finish() method to return ASRResult instead of String
Update CLI commands to handle new finish() return type
Add test for StreamingTranscriptionUpdate with tokenTimings

This enables token-level precision for diarization by exposing individual token timestamps from the ASR model.

- Add tokenTimings property to StreamingTranscriptionUpdate struct - Update StreamingAsrManager to pass tokenTimings from ASRResult - Change finish() method to return ASRResult instead of String - Update CLI commands to handle new finish() return type - Add test for StreamingTranscriptionUpdate with tokenTimings This enables token-level precision for diarization by exposing individual token timestamps from the ASR model. 🤖 Generated with [Claude Code](https://claude.ai/code) Co-authored-by: Brandon Weng <[email protected]>

claude · 2025-10-02T14:40:18Z

Claude finished @BrandonWeng's task —— View job

PR Review Complete ✅

Reviewed PR #132: Add token-level timestamps to StreamingTranscriptionUpdate

Todo List:

Review Summary

This PR successfully adds token-level timestamps to enable diarization precision. The implementation is solid with good backward compatibility and proper error handling.

Key Findings:

✅ Maintains backward compatibility with optional tokenTimings
✅ Proper API design - finish() now returns full ASRResult
✅ Good test coverage for new functionality
📝 Minor suggestion: Use more appropriate fallback confidence value (currently 1.0)
📝 Consider extracting fallback ASRResult creation to helper method

Overall Assessment: Clean implementation that adds the requested functionality without breaking existing code. Recommend approval with minor suggestions addressed.

Full review submitted as non-blocking comments.

claude

Code Review Summary

This PR successfully adds token-level timestamps to StreamingTranscriptionUpdate for diarization precision. The implementation is well-structured and maintains backward compatibility. Here are my detailed findings:

✅ Strengths

Backward Compatibility: Optional tokenTimings parameter ensures existing code continues to work
Consistent API Design: The finish() method now returns the full ASRResult instead of just text, providing access to all timing information
Proper Error Handling: Fallback logic when tokens aren't available
Good Test Coverage: New test specifically validates token timing functionality

📝 Code Quality Issues

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift:179

Nit: Using confidence: 1.0 in the fallback case implies perfect confidence when we have no confidence information. Consider using 0.0 or a dedicated constant like UNKNOWN_CONFIDENCE.

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift:175-184

Nit: The fallback ASRResult creation could be extracted to a helper method for better readability and reusability.

🔧 API Design Considerations

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift:149

Good: The breaking change to finish() return type is justified since it provides more complete information for diarization use cases.

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift:586

Good: Making tokenTimings optional with a default nil value maintains compatibility while enabling the new functionality.

🧪 Testing

Tests/FluidAudioTests/StreamingAsrManagerTests.swift:161-183

Good: Comprehensive test for the new token timing functionality
Suggestion: Consider adding a test that verifies the finish() method returns token timings when available

📋 CLI Updates

Sources/FluidAudioCLI/Commands/ASR/TranscribeCommand.swift:333-334
Sources/FluidAudioCLI/Commands/MultiStreamCommand.swift:165-168

Good: Proper adaptation to the new finish() return type while maintaining existing behavior

Overall Assessment

This is a clean implementation that adds the requested functionality without breaking existing code. The changes are minimal, focused, and well-tested. The API design decision to return the full ASRResult from finish() is good as it provides access to all available information including token timings.

Recommendation: Approve with minor suggestions addressed.

github-actions · 2025-10-02T14:43:06Z

VAD Benchmark Results

Performance Comparison

Dataset	Accuracy	Precision	Recall	F1-Score	RTFx	Files
MUSAN	92.0%	86.2%	100.0%	92.6%	391.5x faster	50
VOiCES	92.0%	86.2%	100.0%	92.6%	380.2x faster	50

Dataset Details

MUSAN: Music, Speech, and Noise dataset - standard VAD evaluation
VOiCES: Voices Obscured in Complex Environmental Settings - tests robustness in real-world conditions

✅: Average F1-Score above 70%

github-actions · 2025-10-02T14:43:06Z

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Evaluating "who spoke when" detection accuracy

Metric	Value	Target	Status	Description
DER	15.1%	<30%	✅	Diarization Error Rate (lower is better)
JER	24.9%	<25%	✅	Jaccard Error Rate
RTFx	17.19x	>1.0x	✅	Real-Time Factor (higher is faster)

Diarization Pipeline Timing Breakdown

Time spent in each stage of speaker diarization

Stage	Time (s)	%	Description
Model Download	9.637	15.8	Fetching diarization models
Model Compile	4.130	6.8	CoreML compilation
Audio Load	0.068	0.1	Loading audio file
Segmentation	18.296	30.0	Detecting speech regions
Embedding	30.493	49.9	Extracting speaker voices
Clustering	12.197	20.0	Grouping same speakers
Total	61.048	100	Full pipeline

Speaker Diarization Research Comparison

Research baselines typically achieve 18-30% DER on standard datasets

Method	DER	Notes
FluidAudio	15.1%	On-device CoreML
Research baseline	18-30%	Standard dataset performance

Note: RTFx shown above is from GitHub Actions runner. On Apple Silicon with ANE:

M2 MacBook Air (2022): Runs at 150 RTFx real-time
Performance scales with Apple Neural Engine capabilities

_{🎯 Speaker Diarization Test • AMI Corpus ES2004a • 1049.0s meeting audio • 61.0s diarization time • Test runtime: 1m 43s • 10/02/2025, 09:24 PM EST}

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

chatgpt-codex-connector · 2025-10-02T14:45:03Z

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift

                audioSamples: [],  // Not needed for final text conversion
                processingTime: 0


Final result never includes token timings

The new finish() API advertises “The complete ASR result with token timings”, but the implementation still discards all timing information. The manager only accumulates token IDs and, when building the final ASRResult, calls processTranscriptionResult with empty timestamps and confidences, so ASRResult.tokenTimings is always empty or nil even when the streaming updates contained per-token timings. Any consumer relying on finish() to provide token-level timestamps for diarization will never receive them. Consider retaining the token timestamps/confidences as they stream and pass them into processTranscriptionResult (or aggregate the interim token timings) before returning the final result.

Useful? React with 👍 / 👎.

github-actions · 2025-10-02T14:55:53Z

ASR Benchmark Results ✅

Status: All benchmarks passed

Parakeet v3 (multilingual)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.57%	0.00%	2.70x	✅
test-other	1.59%	0.00%	1.41x	✅

Parakeet v2 (English-optimized)

Dataset	WER Avg	WER Med	RTFx	Status
test-clean	0.40%	0.00%	1.90x	✅
test-other	1.22%	0.00%	1.46x	✅

Streaming (v3)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.23x	Streaming real-time factor
Avg Chunk Time	3.742s	Average time to process each chunk
Max Chunk Time	5.235s	Maximum chunk processing time
First Token	4.650s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

Streaming (v2)

Metric	Value	Description
WER	0.00%	Word Error Rate in streaming mode
RTFx	0.21x	Streaming real-time factor
Avg Chunk Time	4.356s	Average time to process each chunk
Max Chunk Time	5.917s	Maximum chunk processing time
First Token	4.035s	Latency to first transcription token
Total Chunks	31	Number of chunks processed

_{Streaming tests use 5 files with 0.5s chunks to simulate real-time audio streaming}

_{25 files per dataset • Test runtime: 11m18s • 10/02/2025, 09:35 PM EST}

_{RTFx = Real-Time Factor (higher is better) • Calculated as: Total audio duration ÷ Total processing time
Processing time includes: Model inference on Apple Neural Engine, audio preprocessing, state resets between files, token-to-text conversion, and file I/O
Example: RTFx of 2.0x means 10 seconds of audio processed in 5 seconds (2x faster than real-time)}

Expected RTFx Performance on Physical M1 Hardware:

• M1 Mac: ~28x (clean), ~25x (other)
• CI shows ~0.5-3x due to virtualization limitations

_{Testing methodology follows HuggingFace Open ASR Leaderboard}

BrandonWeng · 2025-10-02T18:36:05Z

@codex take another pass

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting

@codex fix this CI failure
@codex address that feedback

chatgpt-codex-connector · 2025-10-02T18:40:19Z

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift

            )

            // Update state
-            accumulatedTokens.append(contentsOf: tokens)
+            accumulateTokenMetadata(tokens: tokens, timestamps: timestamps, confidences: confidences)


Offset streaming timestamps before collecting token metadata

The final tokenTimings returned by finish() will not represent real positions in the audio stream. Each call to transcribeStreamingChunk yields frame indices that are relative to the current sliding window (the call is made with contextFrameAdjustment: 0 and no global offset), yet those indices are appended verbatim via accumulateTokenMetadata. Because every window includes left context, tokens emitted after the first chunk retain timestamps near the beginning of their window rather than the absolute time in the session. When the accumulated timestamps are converted to TokenTimings, later tokens appear out of order and their start times repeat, which defeats the diarization precision this change is meant to provide. The timestamps should be shifted by the window’s absolute position (e.g. using nextWindowCenterStart or lastProcessedFrame) before being stored.

Useful? React with 👍 / 👎.

Sources/FluidAudio/ASR/AsrTranscription.swift

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift

Alex-Wengg · 2025-10-04T02:00:59Z

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift

+            let duration = TimeInterval(sampleCount) / TimeInterval(config.asrConfig.sampleRate)
+            return ASRResult(
+                text: finalTranscriptText(),
+                confidence: 1.0,


why is this hardcoded to 1.0

Alex-Wengg · 2025-10-04T02:02:01Z

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift

+            maxTimestampFrame > 0
+            ? (maxTimestampFrame + 1) * ASRConstants.samplesPerEncoderFrame : 0
+        let sampleCount = max(totalSamplesProcessed, derivedSampleCount)
+        let processingTime = sampleCount > 0 ? max(elapsedTime, minimumProcessingTime) : elapsedTime


whats the difference between these two variables exactly

Alex-Wengg · 2025-10-04T02:02:35Z

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift

+            logger.warning(
+                "Final token \(label) count (\(values.count)) does not match token count (\(expectedCount)); \(omissionDetail)"
+            )
+            return []


why are we returning [] for a mismatch count ?

Alex-Wengg · 2025-10-04T02:03:15Z

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift


+#if DEBUG
+extension StreamingAsrManager {
+    internal func setAsrManagerForTesting(_ manager: AsrManager?) {


should this be async ?

Alex-Wengg · 2025-10-04T16:20:19Z

comments done

BrandonWeng · 2025-10-08T17:56:22Z

Not happy with the PR, going to rewrite

BrandonWeng added enhancement New feature or request speech-to-text issues related to transcription/asr labels Oct 2, 2025

claude bot reviewed Oct 2, 2025

View reviewed changes

Format

fa37861

chatgpt-codex-connector bot reviewed Oct 2, 2025

View reviewed changes

Reset after streaming

d1d01b7

chatgpt-codex-connector bot reviewed Oct 2, 2025

View reviewed changes

BrandonWeng added 2 commits October 2, 2025 16:16

Fix sampels

a044a72

Formatting

e144f82

Alex-Wengg reviewed Oct 2, 2025

View reviewed changes

Sources/FluidAudio/ASR/AsrTranscription.swift Show resolved Hide resolved

Alex-Wengg reviewed Oct 2, 2025

View reviewed changes

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift Show resolved Hide resolved

Alex-Wengg reviewed Oct 2, 2025

View reviewed changes

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift Outdated Show resolved Hide resolved

Alex-Wengg reviewed Oct 2, 2025

View reviewed changes

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift Show resolved Hide resolved

Alex-Wengg reviewed Oct 2, 2025

View reviewed changes

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift Outdated Show resolved Hide resolved

Alex-Wengg reviewed Oct 2, 2025

View reviewed changes

Sources/FluidAudio/ASR/Streaming/StreamingAsrManager.swift Outdated Show resolved Hide resolved

BrandonWeng added 2 commits October 2, 2025 19:36

remove unsued

46fed69

format

d562092

BrandonWeng requested a review from Alex-Wengg October 3, 2025 20:32

FluidInference deleted a comment from claude bot Oct 3, 2025

Alex-Wengg reviewed Oct 4, 2025

View reviewed changes

BrandonWeng closed this Oct 8, 2025

		audioSamples: [], // Not needed for final text conversion
		processingTime: 0

Add token-level timestamps to StreamingTranscriptionUpdate #132

Add token-level timestamps to StreamingTranscriptionUpdate #132

Conversation

BrandonWeng commented Oct 2, 2025

Changes

Uh oh!

claude bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review Complete ✅

Todo List:

Review Summary

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Code Review Summary

✅ Strengths

📝 Code Quality Issues

🔧 API Design Considerations

🧪 Testing

📋 CLI Updates

Overall Assessment

Uh oh!

github-actions bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

VAD Benchmark Results

Performance Comparison

Dataset Details

Uh oh!

github-actions bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Speaker Diarization Benchmark Results

Speaker Diarization Performance

Diarization Pipeline Timing Breakdown

Speaker Diarization Research Comparison

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Oct 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

ASR Benchmark Results ✅

Parakeet v3 (multilingual)

Parakeet v2 (English-optimized)

Streaming (v3)

Streaming (v2)

Expected RTFx Performance on Physical M1 Hardware:

Uh oh!

BrandonWeng commented Oct 2, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Oct 2, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Alex-Wengg Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Wengg Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Wengg Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Wengg Oct 4, 2025

Choose a reason for hiding this comment

Uh oh!

Alex-Wengg commented Oct 4, 2025

Uh oh!

claude bot commented Oct 2, 2025 •

edited

Loading

github-actions bot commented Oct 2, 2025 •

edited

Loading

github-actions bot commented Oct 2, 2025 •

edited

Loading

github-actions bot commented Oct 2, 2025 •

edited

Loading