Skip to content

Native Swift and CoreML SDK for local speaker diarization, VAD, and speech-to-text for real-time workloads. Works on iOS and macOS.

License

Notifications You must be signed in to change notification settings

FluidInference/FluidAudio

Repository files navigation

banner.png

FluidAudio - Speaker diarization, voice-activity-detection and transcription with CoreML

Swift Platform Discord All Models Ask DeepWiki

Fluid Audio is a Swift SDK for fully local, low-latency audio AI on Apple devices, with inference offloaded to the Apple Neural Engine (ANE), resulting in less memory and generally faster inference.

The SDK includes state-of-the-art speaker diarization, transcription, and voice activity detection via open-source models (MIT/Apache 2.0) that can be integrated with just a few lines of code. Models are optimized for background processing, ambient computing and always on workloads by running inference on the ANE, minimizing CPU usage and avoiding GPU/MPS entirely.

For custom use cases, feedback, additional model support, or platform requests, join our Discord. We’re also bringing visual, language, and TTS models to device and will share updates there.

Below are some featured local AI apps using Fluid Audio models on macOS and iOS:

Voice Ink Spokenly Slipbox

Want to convert your own model? Check möbius

Highlights

  • Automatic Speech Recognition (ASR): Parakeet TDT v3 (0.6b) for transcription; supports all 25 European languages
  • Speaker Diarization: Speaker separation with speaker clustering via Pyannote models
  • Speaker Embedding Extraction: Generate speaker embeddings for voice comparison and clustering, you can use this for speaker identification
  • Voice Activity Detection (VAD): Voice activity detection with Silero models
  • Real-time Processing: Designed for near real-time workloads but also works for offline processing
  • Apple Neural Engine: Models run efficiently on Apple's ANE for maximum performance with minimal power consumption
  • Open-Source Models: All models are publicly available on HuggingFace — converted and optimized by our team; permissive licenses

Installation

Add FluidAudio to your project using Swift Package Manager:

dependencies: [
    .package(url: "https://github.com/FluidInference/FluidAudio.git", from: "0.6.1"),
],

CocoaPods: We recommend using cocoapods-spm for better SPM integration, but if needed, you can also use our podspec: pod 'FluidAudio', '~> 0.6.1'

Important: When adding FluidAudio as a package dependency, only add the library to your target (not the executable). Select FluidAudio library in the package products dialog and add it to your app target.

Documentation

DeepWiki for auto-generated docs for this repo.

Documentation Index

MCP Server

The repo is indexed by DeepWiki MCP server, so your coding tool can access the docs:

{
  "mcpServers": {
    "deepwiki": {
      "url": "https://mcp.deepwiki.com/mcp"
    }
  }
}

For claude code:

claude mcp add -s user -t http deepwiki https://mcp.deepwiki.com/mcp

Automatic Speech Recognition (ASR) / Transcription

  • Models:
    • FluidInference/parakeet-tdt-0.6b-v3-coreml (multilingual, 25 European languages)
    • FluidInference/parakeet-tdt-0.6b-v2-coreml (English-only, highest recall)
  • Processing Mode: Batch transcription for complete audio files
  • Real-time Factor: ~190x on M4 Pro (processes 1 hour of audio in ~19 seconds)
  • Streaming Support: Coming soon — batch processing is recommended for production use
  • Backend: Same Parakeet TDT v3 model powers our backend ASR

ASR Quick Start

import FluidAudio

// Batch transcription from an audio file
Task {
    // 1) Initialize ASR manager and load models
    let models = try await AsrModels.downloadAndLoad(version: .v3)  // Switch to .v2 for English-only work
    let asrManager = AsrManager(config: .default)
    try await asrManager.initialize(models: models)

    // 3) Transcribe the audio 16hz, already converted
    let result = try await asrManager.transcribe(samples)

    // 3) Transcribe a file
    // let url = URL(fileURLWithPath: sample.audioPath)

    // 3) Transcribe AVAudioPCMBuffer
    // let result = try await asrManager.transcribe(audioBuffer)
    print("Transcription: \(result.text)")
}
# Transcribe an audio file (batch)
swift run fluidaudio transcribe audio.wav

# English-only run with higher recall
swift run fluidaudio transcribe audio.wav --model-version v2

Speaker Diarization

AMI Benchmark Results (Single Distant Microphone) using a subset of the files:

  • DER: 17.7% — Competitive with Powerset BCE 2023 (18.5%)
  • JER: 28.0% — Outperforms EEND 2019 (25.3%) and x-vector clustering (28.7%)
  • RTF: 0.02x — Real-time processing with 50x speedup

Speaker Diarization Quick Start

import FluidAudio

// Diarize an audio file
Task {
    let models = try await DiarizerModels.downloadIfNeeded()
    let diarizer = DiarizerManager()  // Uses optimal defaults (0.7 threshold = 17.7% DER)
    diarizer.initialize(models: models)

    // Prepare 16 kHz mono samples (see: Audio Conversion)
    let samples = try await loadSamples16kMono(path: "path/to/meeting.wav")

    // Run diarization
    let result = try diarizer.performCompleteDiarization(samples)
    for segment in result.segments {
        print("Speaker \(segment.speakerId): \(segment.startTimeSeconds)s - \(segment.endTimeSeconds)s")
    }
}

For diarization streaming see Documentation/SpeakerDiarization.md

swift run fluidaudio diarization-benchmark --single-file ES2004a \
  --chunk-seconds 3 --overlap-seconds 2

CLI

# Process an individual file and save JSON
swift run fluidaudio process meeting.wav --output results.json --threshold 0.6

Voice Activity Detection (VAD)

Silero VAD powers our on-device detector. The latest release surfaces the same timestamp extraction and streaming heuristics as the upstream PyTorch implementation. Ping us on Discord if you need help tuning it for your environment.

VAD Quick Start (Offline Segmentation)

Simple call to return chunk-level probabilities every 256 ms hop:

let results = try await manager.process(samples)
for (index, chunk) in results.enumerated() {
    print(
        String(
            format: "Chunk %02d: prob=%.3f, inference=%.4fs",
            index,
            chunk.probability,
            chunk.processingTime
        )
    )
}

The following are higher level APIs better suited to integrate with other systems

import FluidAudio

Task {
    let manager = try await VadManager(
        config: VadConfig(threshold: 0.75)
    )

    let audioURL = URL(fileURLWithPath: "path/to/audio.wav")
    let samples = try AudioConverter().resampleAudioFile(audioURL)

    var segmentation = VadSegmentationConfig.default
    segmentation.minSpeechDuration = 0.25
    segmentation.minSilenceDuration = 0.4

    let segments = try await manager.segmentSpeech(samples, config: segmentation)
    for segment in segments {
        print(
            String(format: "Speech %.2f–%.2fs", segment.startTime, segment.endTime)
        )
    }
}

Streaming

import FluidAudio

Task {
    let manager = try await VadManager()
    var state = await manager.makeStreamState()

    for chunk in microphoneChunks {
        let result = try await manager.processStreamingChunk(
            chunk,
            state: state,
            config: .default,
            returnSeconds: true,
            timeResolution: 2
        )

        state = result.state

        // Access raw probability (0.0-1.0) for custom logic
        print(String(format: "Probability: %.3f", result.probability))

        if let event = result.event {
            let label = event.kind == .speechStart ? "Start" : "End"
            print("\(label) @ \(event.time ?? 0)s")
        }
    }
}

CLI

Start with the general-purpose process command, which runs the diarization pipeline (and therefore VAD) end-to-end on a single file:

swift run fluidaudio process path/to/audio.wav

Once you need to experiment with VAD-specific knobs directly, reach for:

# Inspect offline segments (default mode)
swift run fluidaudio vad-analyze path/to/audio.wav

# Streaming simulation only (timestamps printed in seconds by default)
swift run fluidaudio vad-analyze path/to/audio.wav --streaming

# Benchmark accuracy/precision trade-offs
swift run fluidaudio vad-benchmark --num-files 50 --threshold 0.3

swift run fluidaudio vad-analyze --help lists every tuning option, including negative-threshold overrides, max-speech splitting, padding, and chunk size. Offline mode also reports RTFx using the model's per-chunk processing time.

Text‑To‑Speech (TTS)

⚠️ Beta: The TTS system is currently in beta and only supports American English. Additional language support is planned for future releases.

  • Model: Kokoro (CoreML unified model)
  • Language: American English (beta)
  • G2P: Dictionary first, then eSpeak NG (CEspeakNG) for OOV words
  • Output: 24 kHz mono WAV

Requirements (macOS) Ensure eSpeak NG headers/libs are available via pkg-config (espeak-ng). https://github.com/espeak-ng/espeak-ng/tree/master

Quick Start (CLI)

# First run will download the Kokoro model and vocab
swift run fluidaudio tts "Hello from FluidAudio." --auto-download --output out.wav

# Another example with punctuation and OOV handling
swift run fluidaudio tts "Edge-cases: URLs like https://example.com and e-mail [email protected]." --output out2.wav

Notes

  • The TTS pipeline uses a word→phoneme dictionary first; unknown words are phonemized with eSpeak NG (C API) and mapped to the model’s token set.
  • OOV words are printed with their IPA and mapped tokens for visibility during synthesis.
  • We do not prepend any “language token” to avoid leading vowel artifacts.

Quick Start (Code)

import FluidAudio

Task {
  do {
    let data = try await KokoroModel.synthesize(text: "Hello from FluidAudio.")
    try data.write(to: URL(fileURLWithPath: "out.wav"))
  } catch {
    print("TTS error: \(error)")
  }
}

Troubleshooting Build requires eSpeak NG headers/libs for the C API discoverable via pkg-config (espeak-ng).

  • If SwiftPM cannot find headers, build with explicit paths:
    • swift build -Xcc -I/opt/homebrew/include -Xlinker -L/opt/homebrew/lib
  • Dictionary and model assets are cached under ~/.cache/fluidaudio/Models/kokoro.

Showcase

Showcase

Make a PR if you want to add your app!

App Description
Voice Ink Local AI for instant, private transcription with near-perfect accuracy. Uses Parakeet ASR.
Spokenly Mac dictation app for fast, accurate voice-to-text; supports real-time dictation and file transcription. Uses Parakeet ASR and speaker diarization.
Senko A very fast and accurate speaker diarization pipeline. A good example for how to integrate FluidAudio into a Python app
Slipbox Privacy-first meeting assistant for real-time conversation intelligence. Uses Parakeet ASR (iOS) and speaker diarization across platforms.
Whisper Mate Transcribes movies and audio locally; records and transcribes in real time from speakers or system apps. Uses speaker diarization.
Altic/Fluid Voice-to-text dictation app for macOS with AI enhancement.
Paraspeech AI powered voice to text. Fully offline. No subscriptions.
mac-whisper-speedtest Comparison of different local ASR, including one of the first verions of FluidAudio's ASR models

Everything Else

FAQs

  • CLI is available on macOS only. For iOS, use the library programmatically.
  • Models auto-download on first use. If your network restricts Hugging Face access, set an HTTPS proxy: export https_proxy=http://127.0.0.1:7890.
  • Windows alternative in development: fluid-server
  • If you're looking to get the system audio on a Mac, take a look at this repo for reference AudioCap

License

Apache 2.0 — see LICENSE for details.

Acknowledgments

This project builds upon the excellent work of the sherpa-onnx project for speaker diarization algorithms and techniques.

Pyannote: https://github.com/pyannote/pyannote-audio

Wewpeaker: https://github.com/wenet-e2e/wespeaker

Parakeet-mlx: https://github.com/senstella/parakeet-mlx

silero-vad: https://github.com/snakers4/silero-vad

Kokoro-82M: https://huggingface.co/hexgrad/Kokoro-82M

Citation

If you use FluidAudio in your work, please cite:

FluidInference Team. (2024). FluidAudio: Local Speaker Diarization, ASR, and VAD for Apple Platforms (Version 0.5.1) [Computer software]. GitHub. https://github.com/FluidInference/FluidAudio

@software{FluidInferenceTeam_FluidAudio_2024,
  author = {{FluidInference Team}},
  title = {{FluidAudio: Local Speaker Diarization, ASR, and VAD for Apple Platforms}},
  year = {2024},
  month = {12},
  version = {0.5.1},
  url = {https://github.com/FluidInference/FluidAudio},
  note = {Computer software}
}