The best audio processing library built on Apple's MLX framework, providing fast and efficient text-to-speech (TTS), speech-to-text (STT), and speech-to-speech (STS) on Apple Silicon.
- Fast inference optimized for Apple Silicon (M series chips)
- Multiple model architectures for TTS, STT, and STS
- Multilingual support across models
- Voice customization and cloning capabilities
- Adjustable speech speed control
- Interactive web interface with 3D audio visualization
- OpenAI-compatible REST API
- Quantization support (3-bit, 4-bit, 6-bit, 8-bit, and more) for optimized performance
- Swift package for iOS/macOS integration
pip install mlx-audioFor development or web interface:
git clone https://github.com/Blaizzy/mlx-audio.git
cd mlx-audio
pip install -e ".[dev]"# Basic TTS generation
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello, world!"
# With voice selection and speed adjustment
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello!" --voice af_heart --speed 1.2
# Play audio immediately
mlx_audio.tts.generate --model mlx-community/Kokoro-82M-bf16 --text "Hello!" --playfrom mlx_audio.tts.utils import load_model
# Load model
model = load_model("mlx-community/Kokoro-82M-bf16")
# Generate speech
for result in model.generate("Hello from MLX-Audio!", voice="af_heart"):
print(f"Generated {result.audio.shape[0]} samples")
# result.audio contains the waveform as mx.array| Model | Description | Languages | Repo |
|---|---|---|---|
| Kokoro | Fast, high-quality multilingual TTS | EN, JA, ZH, FR, ES, IT, PT, HI | mlx-community/Kokoro-82M-bf16 |
| CSM | Conversational Speech Model with voice cloning | EN | mlx-community/csm-1b |
| Dia | Dialogue-focused TTS | EN | mlx-community/Dia-1.6B-bf16 |
| OuteTTS | Efficient TTS model | EN | mlx-community/OuteTTS-0.2-500M |
| Spark | SparkTTS model | EN, ZH | mlx-community/SparkTTS-0.5B-bf16 |
| Chatterbox | Expressive multilingual TTS | EN, ES, FR, DE, IT, PT, PL, TR, RU, NL, CS, AR, ZH, JA, HU, KO | mlx-community/Chatterbox-bf16 |
| Soprano | High-quality TTS | EN | mlx-community/Soprano-bf16 |
| Model | Description | Languages | Repo |
|---|---|---|---|
| Whisper | OpenAI's robust STT model | 99+ languages | mlx-community/whisper-large-v3-mlx |
| Parakeet | NVIDIA's accurate STT | EN | mlx-community/parakeet-tdt-0.6b-v2 |
| Voxtral | Mistral's speech model | Multiple | mlx-community/Voxtral-Mini-3B-2507-bf16 |
| Model | Description | Use Case | Repo |
|---|---|---|---|
| SAM-Audio | Text-guided source separation | Extract specific sounds | mlx-community/sam-audio-large |
| MossFormer2 SE | Speech enhancement | Noise removal | starkdmi/MossFormer2_SE_48K_MLX |
Kokoro is a fast, multilingual TTS model with 54 voice presets.
from mlx_audio.tts.utils import load_model
model = load_model("mlx-community/Kokoro-82M-bf16")
# Generate with different voices
for result in model.generate(
text="Welcome to MLX-Audio!",
voice="af_heart", # American female
speed=1.0,
lang_code="a" # American English
):
audio = result.audioAvailable Voices:
- American English:
af_heart,af_bella,af_nova,af_sky,am_adam,am_echo, etc. - British English:
bf_alice,bf_emma,bm_daniel,bm_george, etc. - Japanese:
jf_alpha,jm_kumo, etc. - Chinese:
zf_xiaobei,zm_yunxi, etc.
Language Codes:
| Code | Language | Note |
|---|---|---|
a |
American English | Default |
b |
British English | |
j |
Japanese | Requires pip install misaki[ja] |
z |
Mandarin Chinese | Requires pip install misaki[zh] |
e |
Spanish | |
f |
French |
Clone any voice using a reference audio sample:
mlx_audio.tts.generate \
--model mlx-community/csm-1b \
--text "Hello from Sesame." \
--ref_audio ./reference_voice.wav \
--playfrom mlx_audio.stt.utils import load_model, transcribe
model = load_model("mlx-community/whisper-large-v3-mlx")
result = transcribe("audio.wav", model=model)
print(result["text"])Separate specific sounds from audio using text prompts:
from mlx_audio.sts import SAMAudio, SAMAudioProcessor, save_audio
model = SAMAudio.from_pretrained("mlx-community/sam-audio-large")
processor = SAMAudioProcessor.from_pretrained("mlx-community/sam-audio-large")
batch = processor(
descriptions=["A person speaking"],
audios=["mixed_audio.wav"],
)
result = model.separate_long(
batch.audios,
descriptions=batch.descriptions,
anchors=batch.anchor_ids,
chunk_seconds=10.0,
overlap_seconds=3.0,
ode_opt={"method": "midpoint", "step_size": 2/32},
)
save_audio(result.target[0], "voice.wav")
save_audio(result.residual[0], "background.wav")Remove noise from speech recordings:
from mlx_audio.sts import MossFormer2SEModel, save_audio
model = MossFormer2SEModel.from_pretrained("starkdmi/MossFormer2_SE_48K_MLX")
enhanced = model.enhance("noisy_speech.wav")
save_audio(enhanced, "clean.wav", 48000)MLX-Audio includes a modern web interface and OpenAI-compatible API.
# Start API server
mlx_audio.server --host 0.0.0.0 --port 8000
# Start web UI (in another terminal)
cd mlx_audio/ui
npm install && npm run devText-to-Speech (OpenAI-compatible):
curl -X POST http://localhost:8000/v1/audio/speech \
-H "Content-Type: application/json" \
-d '{"model": "mlx-community/Kokoro-82M-bf16", "input": "Hello!", "voice": "af_heart"}' \
--output speech.wavSpeech-to-Text:
curl -X POST http://localhost:8000/v1/audio/transcriptions \
-F "[email protected]" \
-F "model=mlx-community/whisper-large-v3-mlx"Reduce model size and improve performance with quantization using the convert script:
# Convert and quantize to 4-bit
python -m mlx_audio.convert \
--hf-path prince-canuma/Kokoro-82M \
--mlx-path ./Kokoro-82M-4bit \
--quantize \
--q-bits 4 \
--upload-repo username/Kokoro-82M-4bit (optional: if you want to upload the model to Hugging Face)
# Convert with specific dtype (bfloat16)
python -m mlx_audio.convert \
--hf-path prince-canuma/Kokoro-82M \
--mlx-path ./Kokoro-82M-bf16 \
--dtype bfloat16 \
--upload-repo username/Kokoro-82M-bf16 (optional: if you want to upload the model to Hugging Face)Options:
| Flag | Description |
|---|---|
--hf-path |
Source Hugging Face model or local path |
--mlx-path |
Output directory for converted model |
-q, --quantize |
Enable quantization |
--q-bits |
Bits per weight (4, 6, or 8) |
--q-group-size |
Group size for quantization (default: 64) |
--dtype |
Weight dtype: float16, bfloat16, float32 |
--upload-repo |
Upload converted model to HF Hub |
Pre-quantized models available:
- mlx-community/Kokoro-82M-4bit
- mlx-community/Kokoro-82M-6bit
- mlx-community/Kokoro-82M-8bit
- mlx-community/Kokoro-82M-bf16
Native Swift package for iOS and macOS apps.
Add to your Package.swift:
dependencies: [
.package(url: "https://github.com/Blaizzy/mlx-audio.git", from: "0.2.5")
]import MLXAudio
// Create session (auto-downloads model)
let session = try await MarvisSession(voice: .conversationalA)
// Generate speech (auto-plays)
let result = try await session.generate(for: "Hello from Swift!")
// Streaming generation
for try await chunk in session.stream(text: "Streaming audio...") {
print("Chunk: \(chunk.sampleCount) samples")
}Supported Platforms: macOS 14.0+, iOS 16.0+
- Python 3.9+
- Apple Silicon Mac (M1/M2/M3/M4)
- MLX framework
@misc{mlx-audio,
author = {Canuma, Prince},
title = {MLX Audio},
year = {2025},
howpublished = {\url{https://github.com/Blaizzy/mlx-audio}},
note = {Audio processing library for Apple Silicon with TTS, STT, and STS capabilities.}
}- Apple MLX Team for the MLX framework