diff --git a/examples/models/voxtral_realtime/README.md b/examples/models/voxtral_realtime/README.md index 4a311e6fd7c..45699133ee4 100644 --- a/examples/models/voxtral_realtime/README.md +++ b/examples/models/voxtral_realtime/README.md @@ -14,9 +14,10 @@ The pipeline has two stages: **export** (Python, once) and **inference** conversion. At inference time, the C++ runner loads both `.pte` files and the Tekken tokenizer, then transcribes audio to text. -Two modes are supported: **offline** (encode full audio, then decode) -and **streaming** (process 80ms chunks in real time, including live -microphone input). +Two modes are supported: **streaming** (process 80ms chunks in real time, +including live microphone input) and **offline** (encode full audio, then +decode). The examples below use streaming mode. Omit `--streaming` from +export and run commands for offline mode. ## Demo: streaming on Metal backend with microphone input @@ -37,24 +38,22 @@ https://github.com/user-attachments/assets/6d6089fc-5feb-458b-a60b-08379855976a ## Preprocessor Export a preprocessor `.pte` to convert raw audio into the format the -model expects. `--max_audio_len 300` supports audio up to 5 minutes -(300 seconds): +model expects: ```bash python -m executorch.extension.audio.mel_spectrogram \ --feature_size 128 \ - --max_audio_len 300 \ + --streaming \ --output_file ./voxtral_rt_exports/preprocessor.pte ``` -For streaming, use a separate preprocessor with `--streaming` (no audio -length limit): +For offline mode: ```bash python -m executorch.extension.audio.mel_spectrogram \ --feature_size 128 \ - --streaming \ - --output_file ./voxtral_streaming_exports/preprocessor.pte + --max_audio_len 300 \ + --output_file ./voxtral_rt_exports/preprocessor.pte ``` ## Export @@ -65,19 +64,7 @@ and token embedding. > [!TIP] > Mistral has already published pre-exported `.pte` files for select backends, including macOS Metal, on their [HuggingFace Hub](https://huggingface.co/mistral-labs/Voxtral-Mini-4B-Realtime-2602-Executorch). - -```bash -python export_voxtral_rt.py \ - --model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \ - --backend xnnpack \ - --output-dir ./voxtral_rt_exports \ - --qlinear-encoder 8da4w \ - --qlinear 8da4w \ - --qembedding 8w -``` - -For streaming, add `--streaming` to export the encoder for incremental -processing (80ms audio chunks): +### XNNPACK (default) ```bash python export_voxtral_rt.py \ @@ -90,65 +77,8 @@ python export_voxtral_rt.py \ --qembedding 8w ``` -### Backend support - -| Backend | Offline | Streaming | Quantization | -|---------|---------|-----------|--------------| -| `xnnpack` | ✓ | ✓ | `4w`, `8w`, `8da4w`, `8da8w` | -| `metal` | ✓ | ✓ | none (fp32) or `fpa4w` (Metal-specific 4-bit) | -| `cuda` | ✓ | ✓ | `4w`, `8w` | -| `cuda-windows` | ✓ | ✓ | `4w`, `8w` | - -Metal backend provides Apple GPU acceleration. CUDA backend provides NVIDIA GPU -acceleration via AOTInductor. - -#### CUDA export examples - -Offline with int4 quantization: - -```bash -python export_voxtral_rt.py \ - --model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \ - --backend cuda \ - --dtype bf16 \ - --output-dir ./voxtral_rt_exports \ - --qlinear-encoder 4w \ - --qlinear-encoder-packing-format tile_packed_to_4d \ - --qlinear 4w \ - --qlinear-packing-format tile_packed_to_4d \ - --qembedding 8w -``` - -Streaming with int4 quantization: - -```bash -python export_voxtral_rt.py \ - --model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \ - --backend cuda \ - --dtype bf16 \ - --streaming \ - --output-dir ./voxtral_rt_exports \ - --qlinear-encoder 4w \ - --qlinear-encoder-packing-format tile_packed_to_4d \ - --qlinear 4w \ - --qlinear-packing-format tile_packed_to_4d \ - --qembedding 8w -``` - -#### Metal export examples - -Offline: - -```bash -python export_voxtral_rt.py \ - --model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \ - --backend metal \ - --output-dir ./voxtral_rt_exports \ - --qlinear-encoder fpa4w \ - --qlinear fpa4w -``` - -Streaming: +
+Metal ```bash python export_voxtral_rt.py \ @@ -160,38 +90,27 @@ python export_voxtral_rt.py \ --qlinear fpa4w ``` -**Note:** Metal 4-bit quantization requires torchao built with experimental MPS (Metal) ops. +Metal 4-bit quantization (`fpa4w`) requires torchao built with experimental MPS ops: -You can install torchao with Metal support from the `ao` repo (in third-party/ao/) ```bash +# From the ao repo (third-party/ao/) USE_CPP=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 pip install . --no-build-isolation -``` -Alternatively, you can build torchao with Metal support while installing ExecuTorch from source -```bash +# Or while installing ExecuTorch from source EXECUTORCH_BUILD_KERNELS_TORCHAO=1 TORCHAO_BUILD_EXPERIMENTAL_MPS=1 ./install_executorch.sh ``` -### CUDA-Windows Export +
-Before running `cuda-windows` export, make sure these requirements are set up: -- `x86_64-w64-mingw32-g++` is installed and on `PATH` (mingw-w64 cross-compiler). -- `WINDOWS_CUDA_HOME` points to the extracted Windows CUDA package directory. - -Example setup on Ubuntu (refer to [Parakeet README](../parakeet/README.md) for detailed extraction steps): - -```bash -# Ensure the WINDOWS_CUDA_HOME environment variable is set -export WINDOWS_CUDA_HOME=/opt/cuda-windows/extracted/cuda_cudart/cudart -``` - -Export the model for Windows CUDA (example with int4 quantization): +
+CUDA ```bash python export_voxtral_rt.py \ --model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \ - --backend cuda-windows \ + --backend cuda \ --dtype bf16 \ + --streaming \ --output-dir ./voxtral_rt_exports \ --qlinear-encoder 4w \ --qlinear-encoder-packing-format tile_packed_to_4d \ @@ -200,9 +119,18 @@ python export_voxtral_rt.py \ --qembedding 8w ``` -For streaming, add `--streaming`: +
+ +
+CUDA-Windows + +Requires `x86_64-w64-mingw32-g++` on `PATH` (mingw-w64 cross-compiler) and +`WINDOWS_CUDA_HOME` pointing to the extracted Windows CUDA package directory. +See [Parakeet README](../parakeet/README.md#cuda-windows-export) for detailed extraction steps. ```bash +export WINDOWS_CUDA_HOME=/opt/cuda-windows/extracted/cuda_cudart/cudart + python export_voxtral_rt.py \ --model-path ~/models/Voxtral-Mini-4B-Realtime-2602 \ --backend cuda-windows \ @@ -216,16 +144,18 @@ python export_voxtral_rt.py \ --qembedding 8w ``` -Both offline and streaming exports generate: -- `model.pte` -- `aoti_cuda_blob.ptd` +
+ +> [!NOTE] +> Omit `--streaming` from any export command above for offline mode. +> CUDA and CUDA-Windows exports also produce an `aoti_cuda_blob.ptd` file alongside `model.pte`. ### Options | Flag | Default | Description | |------|---------|-------------| | `--model-path` | (required) | Directory with `params.json` + `consolidated.safetensors` | -| `--backend` | `xnnpack` | `xnnpack`, `metal`, `cuda`, or `portable` | +| `--backend` | `xnnpack` | `xnnpack`, `metal`, `cuda`, `cuda-windows`, or `portable` | | `--dtype` | `fp32` | Model dtype: `fp32` or `bf16` | | `--output-dir` | `./voxtral_rt_exports` | Output directory | | `--max-seq-len` | `4096` | KV cache length | @@ -250,36 +180,21 @@ ExecuTorch must be installed from source first (see [Prerequisites](#prerequisites)). The `make` targets below handle building core libraries and the runner binary. -### XNNPACK (CPU) - ```bash -make voxtral_realtime-cpu +make voxtral_realtime-cpu # XNNPACK (CPU) +make voxtral_realtime-metal # Metal (Apple GPU) +make voxtral_realtime-cuda # CUDA (NVIDIA GPU) ``` -This builds ExecuTorch core libraries with XNNPACK, then the runner binary -at `cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner`. - -### CUDA (NVIDIA GPU) - -```bash -make voxtral_realtime-cuda -``` - -This builds ExecuTorch with CUDA backend support. The runner binary is at -the same path as above. Requires NVIDIA GPU with CUDA toolkit installed. - -### Metal (Apple GPU) - -```bash -make voxtral_realtime-metal -``` - -This builds ExecuTorch with Metal backend support. The runner binary is at -the same path as above. Metal exports can only run on macOS with Apple Silicon. +All targets produce the runner binary at +`cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner`. ### CUDA-Windows -On Windows (PowerShell), use CMake workflow presets directly from the executorch root directory. Note that if you exported the model with 4-bit quantization, you may need to specify your GPU's compute capability (e.g., `80;86;89;90;120` for Ampere, Lovelace, Hopper, and Blackwell) to avoid "invalid device function" errors at runtime, as the `int4mm` kernels require SM 80 or newer. +On Windows (PowerShell), use CMake workflow presets from the executorch root +directory. If you exported with 4-bit quantization, specify your GPU's compute +capability to avoid "invalid device function" errors (the `int4mm` kernels +require SM 80+). ```powershell $env:CMAKE_CUDA_ARCHITECTURES="80;86;89;90;120" @@ -296,43 +211,24 @@ The runner requires: - `tekken.json` — tokenizer from the model weights directory - `preprocessor.pte` — mel spectrogram preprocessor (see [Preprocessor](#preprocessor)) - A 16kHz mono WAV audio file (or live audio via `--mic`) -- For CUDA: `aoti_cuda_blob.ptd` — delegate data file (pass via `--data_path`) -### Windows (PowerShell) +### Basic usage -```powershell -.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe ` - --model_path voxtral_rt_exports\model.pte ` - --tokenizer_path C:\path\to\tekken.json ` - --preprocessor_path voxtral_rt_exports\preprocessor.pte ` - --audio_path input.wav -``` - -For CUDA, include the `.ptd` data file: - -```powershell -.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe ` - --model_path voxtral_rt_exports\model.pte ` - --data_path voxtral_rt_exports\aoti_cuda_blob.ptd ` - --tokenizer_path C:\path\to\tekken.json ` - --preprocessor_path voxtral_rt_exports\preprocessor.pte ` - --audio_path input.wav +```bash +cmake-out/examples/models/voxtral_realtime/voxtral_realtime_runner \ + --model_path voxtral_rt_exports/model.pte \ + --tokenizer_path ~/models/Voxtral-Mini-4B-Realtime-2602/tekken.json \ + --preprocessor_path voxtral_rt_exports/preprocessor.pte \ + --audio_path input.wav \ + --streaming ``` -For streaming, add `--streaming`. This requires a model exported with -`--streaming`. The runner processes audio in 80ms steps, computing mel -and running the encoder+decoder incrementally. +Omit `--streaming` for offline transcription (requires an offline-exported +model and offline preprocessor). -```powershell -.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe ` - --model_path voxtral_rt_exports\model.pte ` - --tokenizer_path C:\path\to\tekken.json ` - --preprocessor_path voxtral_rt_exports\preprocessor.pte ` - --audio_path input.wav ` - --streaming -``` +For CUDA backends (Linux and Windows), add `--data_path voxtral_rt_exports/aoti_cuda_blob.ptd`. -For CUDA streaming, include the `.ptd` data file: +**Windows (PowerShell):** ```powershell .\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe ` @@ -344,9 +240,11 @@ For CUDA streaming, include the `.ptd` data file: --streaming ``` -For live microphone input, use `--mic` to read raw 16kHz float32 PCM from -stdin. This requires a model exported with `--streaming` and a streaming -preprocessor. Pipe from any audio capture tool: +### Live microphone input + +Use `--mic` to read raw 16kHz float32 PCM from stdin. Requires a +streaming-exported model and streaming preprocessor. Pipe from any audio +capture tool: ```bash # macOS @@ -360,19 +258,7 @@ ffmpeg -f avfoundation -i ":0" -ar 16000 -ac 1 -f f32le -nostats -loglevel error Ctrl+C stops recording and flushes remaining text. -**Windows (PowerShell):** - -```powershell -.\cmake-out\examples\models\voxtral_realtime\Release\voxtral_realtime_runner.exe ` - --model_path C:\path\to\voxtral_rt_exports\model.pte ` - --data_path C:\path\to\voxtral_rt_exports\aoti_cuda_blob.ptd ` - --tokenizer_path C:\path\to\tekken.json ` - --preprocessor_path C:\path\to\voxtral_rt_exports\preprocessor.pte ` - --audio_path C:\path\to\input.wav -``` - -**CUDA:** Add `--data_path voxtral_rt_exports/aoti_cuda_blob.ptd` to all -run commands above when using the CUDA backend. +### Options | Flag | Default | Description | |------|---------|-------------|