Sample iOS app for stabilityai/stable-audio-open-small — text-to-music generation (497M params). Generates up to 11.9 seconds of stereo 44.1 kHz audio from text prompts using rectified flow diffusion.
ScreenRecording_04-04-2026.13-54-08_1.mov
4 CoreML models: T5 text encoder, NumberEmbedder (seconds conditioning), DiT (diffusion transformer), and VAE decoder (Oobleck).
| Model | Size | Input | Output |
|---|---|---|---|
| StableAudioT5Encoder.mlpackage.zip | 105 MB | input_ids [1, 64] | text_embeddings [1, 64, 768] |
| StableAudioNumberEmbedder.mlpackage.zip | 396 KB | normalized_seconds [1] | seconds_embedding [1, 768] |
| StableAudioDiT.mlpackage.zip | 326 MB (INT8) | latent [1,64,256] + timestep + conditioning | velocity [1,64,256] |
| StableAudioDiT_FP32.mlpackage.zip | 1.3 GB (FP32) | latent [1,64,256] + timestep + conditioning | velocity [1,64,256] |
| StableAudioVAEDecoder.mlpackage.zip | 149 MB | latent [1, 64, 256] | stereo audio [1, 2, 524288] @ 44.1 kHz |
- Download the four
.mlpackage.zipfiles (pick eitherStableAudioDiTorStableAudioDiT_FP32) - Unzip and drag them into the Xcode project
- Build and run on a physical device (iOS 17+)
| Variant | Compute units | Quality | Speed | Memory |
|---|---|---|---|---|
StableAudioDiT (INT8) |
.cpuAndGPU |
Slight quantization loss | Fastest | 326 MB |
StableAudioDiT_FP32 |
.cpuOnly |
Best | Slower | 1.3 GB |
FP16 weights overflow inside the DiT attention on iOS GPU, so the high-quality path requires FP32 compute on CPU. The INT8 GPU path is the practical default for most devices.
- DiT INT8 +
.cpuAndGPU— fastest path, slight quality loss from quantization. Runs cleanly on the GPU because INT8 dequantizes to FP16/FP32 element-wise without the overflow issue. - DiT FP32 +
.cpuOnly— best quality, larger and slower. Required because the FP16 attention path overflows on the iOS GPU. CPU FP32 is the fallback that gives correct results. - T5 INT8 may emit occasional NaN values — the sample app sanitizes the embeddings (replaces NaN with 0) before passing them to the DiT.
- VAE Decoder FP16 —
weight_normmust be removed before tracing because of the Snake activation; see theremove_weight_norm()call in the conversion script.