openvinotoolkit · michalkulakowski · Oct 29, 2025
diff --git a/demos/audio/README.md b/demos/audio/README.md
@@ -1,25 +1,116 @@
-# Audio endpoints
+# How to serve audio models via OpenAI API {#ovms_demos_continuous_batching_vlm}
 
+This demo shows how to deploy audio in the OpenVINO Model Server.
+Speech generation and speech recognition use cases are exposed via OpenAI API `audio/speech` end `audio/transcriptions` endpoint.
 
-## Audio synthesis
+## Prerequisites
 
-python export_model.py text2speech --source_model microsoft/speecht5_tts --vocoder microsoft/speecht5_hifigan --weight-format fp16
+**OVMS version 2025.4** This demo require version 2025.4 or newer.
 
-docker run -p 8000:8000 -d -v $(pwd)/models/:/models openvino/model_server --model_name speecht5_tts --model_path /models/microsoft/speecht5_tts --rest_port 8000
+**Model preparation**: Python 3.9 or higher with pip and HuggingFace account
 
-curl http://localhost/v3/audio/speech -H "Content-Type: application/json" -d "{\"model\": \"speecht5_tts\", \"input\": \"The quick brown fox jumped over the lazy dog.\"}" -o audio.wav
+**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../../docs/deploying_server_baremetal.md)
 
+**(Optional) Client**: git and Python for using OpenAI client package
 
-## Audio transcription
 
-python export_model.py speech2text --source_model openai/whisper-large-v2  --weight-format fp16 --target_device GPU
+## Speech generation
+### Model preparation
+Only supported model for speech generation use case is microsoft/speecht5_tts which is outside OpenVINO organization and needs convertion to IR format.
 
+Specific OVMS pull mode example for models outside of OpenVINO organization is described in section `## Pulling models outside of OpenVINO organization` in the [Ovms pull mode](https://github.com/openvinotoolkit/model_server/blob/main/docs/pull_hf_models.md)
 
-docker run -p 8000:8000 -it --device /dev/dri -u 0 -v $(pwd)/models/:/models openvino/model_server --model_name whisper --model_path /models/openai/whisper-large-v2 --rest_port 8000
+Or you can use the python export_model.py script described below.
 
+Here, the original TTS model will be converted to IR format and optionally quantized.
+That ensures faster initialization time, better performance and lower memory consumption.
+Execution parameters will be defined inside the `graph.pbtxt` file.
 
-curl http://localhost/v3/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@audio.wav" -F model="whisper"
+Download export script, install it's dependencies and create directory for the models:
+```console
+curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/export_models/export_model.py -o export_model.py
+pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/export_models/requirements.txt
+mkdir models
+```
 
+Run `export_model.py` script to download and quantize the model:
 
+> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" before running the export script to connect to the HF Hub.
 
+**CPU**
+```console
+python export_model.py text_to_speech --source_model microsoft/speecht5_tts --weight-format int4 --pipeline_type VLM --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models  --overwrite_models
+```
 
+**GPU**
+```console
+python export_model.py text_generation --source_model microsoft/speecht5_tts --weight-format int4 --pipeline_type VLM --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --target_device GPU
+```
+
+> **Note:** Change the `--weight-format` to quantize the model to `fp16` or `int8` precision to reduce memory consumption and improve performance.
+
+
+You should have a model folder like below:
+```
+models/
+├── config.json
+└── speecht5_tts
+    ├── added_tokens.json
+    ├── config.json
+    ├── generation_config.json
+    ├── graph.pbtxt
+    ├── openvino_decoder_model.bin
+    ├── openvino_decoder_model.xml
+    ├── openvino_detokenizer.bin
+    ├── openvino_detokenizer.xml
+    ├── openvino_encoder_model.bin
+    ├── openvino_encoder_model.xml
+    ├── openvino_postnet.bin
+    ├── openvino_postnet.xml
+    ├── openvino_tokenizer.bin
+    ├── openvino_tokenizer.xml
+    ├── openvino_vocoder.bin
+    ├── openvino_vocoder.xml
+    ├── preprocessor_config.json
+    ├── special_tokens_map.json
+    ├── spm_char.model
+    └── tokenizer_config.json
+```
+
+### Request Generation 
+
+:::{dropdown} **Unary call with curl using image url**
+
+
+```bash
+curl http://localhost:8000/v3/audio/speech -H "Content-Type: application/json" -d "{\"model\": \"microsoft/speecht5_tts\", \"input\": \"The quick brown fox jumped over the lazy dog.\"}" -o speech.wav
+```
+:::
+
+:::{dropdown} **Unary call with python requests library**
+
+```python
+from pathlib import Path
+from openai import OpenAI
+
+prompt = "The quick brown fox jumped over the lazy dog"
+filename = "speech.wav"
+url="http://localhost:8000/v3"
+
+
+speech_file_path = Path(__file__).parent / "speech.wav"
+client = OpenAI(base_url=url, api_key="not_used")
+
+with client.audio.speech.with_streaming_response.create(
+  model="microsoft/speecht5_tts",
+  voice="unused",
+  input=prompt
+) as response:
+  response.stream_to_file(speech_file_path)
+
+
+print("Generation finished")
+```
+:::
+
+speech.wav file contains generated speech.
diff --git a/demos/common/export_models/export_model.py b/demos/common/export_models/export_model.py
@@ -109,7 +109,7 @@ def add_common_arguments(parser):
     [type.googleapis.com / mediapipe.TtsCalculatorOptions]: {
       models_path: "{{model_path}}",
       plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }',
-      device: "{{target_device|default("CPU", true)}}"
+      target_device: "{{target_device|default("CPU", true)}}"
     }
   }
 }
@@ -128,7 +128,7 @@ def add_common_arguments(parser):
     [type.googleapis.com / mediapipe.SttCalculatorOptions]: {
       models_path: "{{model_path}}",
       plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }',
-      device: "{{target_device|default("CPU", true)}}"
+      target_device: "{{target_device|default("CPU", true)}}"
     }
   }
 }

diff --git a/src/BUILD b/src/BUILD
@@ -305,6 +305,8 @@ ovms_cc_library(
         "//src/graph_export:graph_cli_parser",
         "//src/graph_export:rerank_graph_cli_parser",
         "//src/graph_export:embeddings_graph_cli_parser",
+        "//src/graph_export:tts_graph_cli_parser",
+        "//src/graph_export:stt_graph_cli_parser",
         "//src/graph_export:image_generation_graph_cli_parser",
     ],
     visibility = ["//visibility:public",],

diff --git a/src/audio/speech_to_text/stt_calculator.proto b/src/audio/speech_to_text/stt_calculator.proto
@@ -29,6 +29,6 @@ message SttCalculatorOptions {
 
     // fields required for GenAI pipeline initialization
     required string models_path = 1;
-    optional string device = 2;
+    optional string target_device = 2;
     optional string plugin_config = 3;
 }
diff --git a/src/audio/text_to_speech/tts_calculator.proto b/src/audio/text_to_speech/tts_calculator.proto
@@ -29,6 +29,6 @@ message TtsCalculatorOptions {
 
     // fields required for GenAI pipeline initialization
     required string models_path = 1;
-    optional string device = 2;
+    optional string target_device = 2;
     optional string plugin_config = 3;
 }
diff --git a/src/capi_frontend/server_settings.hpp b/src/capi_frontend/server_settings.hpp
@@ -28,6 +28,8 @@ enum GraphExportType : unsigned int {
     RERANK_GRAPH,
     EMBEDDINGS_GRAPH,
     IMAGE_GENERATION_GRAPH,
+    TEXT_TO_SPEECH_GRAPH,
+    SPEECH_TO_TEXT_GRAPH,
     UNKNOWN_GRAPH
 };
 
@@ -43,13 +45,17 @@ const std::map<GraphExportType, std::string> typeToString = {
     {RERANK_GRAPH, "rerank"},
     {EMBEDDINGS_GRAPH, "embeddings"},
     {IMAGE_GENERATION_GRAPH, "image_generation"},
+    {TEXT_TO_SPEECH_GRAPH, "text_to_speech"},
+    {SPEECH_TO_TEXT_GRAPH, "speech_to_text"},
     {UNKNOWN_GRAPH, "unknown_graph"}};
 
 const std::map<std::string, GraphExportType> stringToType = {
     {"text_generation", TEXT_GENERATION_GRAPH},
     {"rerank", RERANK_GRAPH},
     {"embeddings", EMBEDDINGS_GRAPH},
     {"image_generation", IMAGE_GENERATION_GRAPH},
+    {"text_to_speech", TEXT_TO_SPEECH_GRAPH},
+    {"speech_to_text", SPEECH_TO_TEXT_GRAPH},
     {"unknown_graph", UNKNOWN_GRAPH}};
 
 std::string enumToString(GraphExportType type);
@@ -120,6 +126,22 @@ struct EmbeddingsGraphSettingsImpl {
     std::string pooling = "CLS";
 };
 
+struct TextToSpeechGraphSettingsImpl {
+    std::string modelPath = "./";
+    std::string targetDevice = "CPU";
+    std::string modelName = "";
+    uint32_t numStreams = 1;
+};
+
+
+struct SpeechToTextGraphSettingsImpl {
+    std::string modelPath = "./";
+    std::string targetDevice = "CPU";
+    std::string modelName = "";
+    uint32_t numStreams = 1;
+};
+
+
 struct RerankGraphSettingsImpl {
     std::string modelPath = "./";
     std::string targetDevice = "CPU";
@@ -146,6 +168,7 @@ struct ImageGenerationGraphSettingsImpl {
 struct ExportSettings {
     std::string targetDevice = "CPU";
     std::optional<std::string> extraQuantizationParams;
+    std::optional<std::string> vocoder;
     std::string precision = "int8";
 };
 
@@ -157,7 +180,7 @@ struct HFSettingsImpl {
     bool overwriteModels = false;
     ModelDownlaodType downloadType = GIT_CLONE_DOWNLOAD;
     GraphExportType task = TEXT_GENERATION_GRAPH;
-    std::variant<TextGenGraphSettingsImpl, RerankGraphSettingsImpl, EmbeddingsGraphSettingsImpl, ImageGenerationGraphSettingsImpl> graphSettings;
+    std::variant<TextGenGraphSettingsImpl, RerankGraphSettingsImpl, EmbeddingsGraphSettingsImpl, TextToSpeechGraphSettingsImpl, SpeechToTextGraphSettingsImpl, ImageGenerationGraphSettingsImpl> graphSettings;
 };
 
 struct ServerSettingsImpl {

diff --git a/src/cli_parser.cpp b/src/cli_parser.cpp
@@ -26,6 +26,8 @@
 #include "graph_export/graph_cli_parser.hpp"
 #include "graph_export/rerank_graph_cli_parser.hpp"
 #include "graph_export/embeddings_graph_cli_parser.hpp"
+#include "graph_export/tts_graph_cli_parser.hpp"
+#include "graph_export/stt_graph_cli_parser.hpp"
 #include "graph_export/image_generation_graph_cli_parser.hpp"
 #include "ovms_exit_codes.hpp"
 #include "filesystem.hpp"
@@ -209,7 +211,7 @@ void CLIParser::parse(int argc, char** argv) {
                 cxxopts::value<std::string>(),
                 "MODEL_REPOSITORY_PATH")
             ("task",
-                "Choose type of model export: text_generation - chat and completion endpoints, embeddings - embeddings endpoint, rerank - rerank endpoint, image_generation - image generation/edit/inpainting endpoints.",
+                "Choose type of model export: text_generation - chat and completion endpoints, embeddings - embeddings endpoint, rerank - rerank endpoint, image_generation - image generation/edit/inpainting endpoints, text_to_speech - audio/speech endpoint, speech_to_text - audio/transcriptions endpoint.",
                 cxxopts::value<std::string>(),
                 "TASK")
             ("weight-format",
@@ -219,6 +221,10 @@ void CLIParser::parse(int argc, char** argv) {
             ("extra_quantization_params",
                 "Model quantization parameters used in optimum-cli export with conversion for text generation models",
                 cxxopts::value<std::string>(),
+                "EXTRA_QUANTIZATION_PARAMS")
+            ("vocoder",
+                "The vocoder model to use for text2speech. For example microsoft/speecht5_hifigan",
+                cxxopts::value<std::string>(),
                 "EXTRA_QUANTIZATION_PARAMS");
 
         options->add_options("single model")
@@ -334,6 +340,18 @@ void CLIParser::parse(int argc, char** argv) {
                         this->graphOptionsParser = std::move(cliParser);
                         break;
                     }
+                    case TEXT_TO_SPEECH_GRAPH: {
+                        TextToSpeechGraphCLIParser cliParser;
+                        unmatchedOptions = cliParser.parse(result->unmatched());
+                        this->graphOptionsParser = std::move(cliParser);
+                        break;
+                    }
+                    case SPEECH_TO_TEXT_GRAPH: {
+                        SpeechToTextGraphCLIParser cliParser;
+                        unmatchedOptions = cliParser.parse(result->unmatched());
+                        this->graphOptionsParser = std::move(cliParser);
+                        break;
+                    }
                     case UNKNOWN_GRAPH: {
                         std::cerr << "error parsing options - --task parameter unsupported value: " + result->operator[]("task").as<std::string>();
                         exit(OVMS_EX_USAGE);
@@ -411,6 +429,8 @@ void CLIParser::parse(int argc, char** argv) {
             RerankGraphCLIParser parser2;
             EmbeddingsGraphCLIParser parser3;
             ImageGenerationGraphCLIParser imageGenParser;
+            TextToSpeechGraphCLIParser ttsParser;
+            SpeechToTextGraphCLIParser sttParser;
             parser1.printHelp();
             parser2.printHelp();
             parser3.printHelp();
@@ -635,6 +655,8 @@ void CLIParser::prepareGraph(ServerSettingsImpl& serverSettings, HFSettingsImpl&
             hfSettings.exportSettings.precision = result->operator[]("weight-format").as<std::string>();
         if (result->count("extra_quantization_params"))
             hfSettings.exportSettings.extraQuantizationParams = result->operator[]("extra_quantization_params").as<std::string>();
+        if (result->count("vocoder"))
+            hfSettings.exportSettings.vocoder = result->operator[]("vocoder").as<std::string>();
         if (result->count("model_repository_path"))
             hfSettings.downloadPath = result->operator[]("model_repository_path").as<std::string>();
         if (result->count("task")) {
@@ -672,6 +694,22 @@ void CLIParser::prepareGraph(ServerSettingsImpl& serverSettings, HFSettingsImpl&
                     }
                     break;
                 }
+                case TEXT_TO_SPEECH_GRAPH: {
+                    if (std::holds_alternative<TextToSpeechGraphCLIParser>(this->graphOptionsParser)) {
+                        std::get<TextToSpeechGraphCLIParser>(this->graphOptionsParser).prepare(serverSettings.serverMode, hfSettings, modelName);
+                    } else {
+                        throw std::logic_error("Tried to prepare graph settings without graph parser initialization");
+                    }
+                    break;
+                }
+                case SPEECH_TO_TEXT_GRAPH: {
+                    if (std::holds_alternative<SpeechToTextGraphCLIParser>(this->graphOptionsParser)) {
+                        std::get<SpeechToTextGraphCLIParser>(this->graphOptionsParser).prepare(serverSettings.serverMode, hfSettings, modelName);
+                    } else {
+                        throw std::logic_error("Tried to prepare graph settings without graph parser initialization");
+                    }
+                    break;
+                }
                 case UNKNOWN_GRAPH: {
                     throw std::logic_error("Error: --task parameter unsupported value: " + result->operator[]("task").as<std::string>());
                     break;

diff --git a/src/cli_parser.hpp b/src/cli_parser.hpp
@@ -24,6 +24,8 @@
 #include "graph_export/graph_cli_parser.hpp"
 #include "graph_export/rerank_graph_cli_parser.hpp"
 #include "graph_export/embeddings_graph_cli_parser.hpp"
+#include "graph_export/tts_graph_cli_parser.hpp"
+#include "graph_export/stt_graph_cli_parser.hpp"
 #include "graph_export/image_generation_graph_cli_parser.hpp"
 
 namespace ovms {
@@ -34,7 +36,7 @@ struct ModelsSettingsImpl;
 class CLIParser {
     std::unique_ptr<cxxopts::Options> options;
     std::unique_ptr<cxxopts::ParseResult> result;
-    std::variant<GraphCLIParser, RerankGraphCLIParser, EmbeddingsGraphCLIParser, ImageGenerationGraphCLIParser> graphOptionsParser;
+    std::variant<GraphCLIParser, RerankGraphCLIParser, EmbeddingsGraphCLIParser, ImageGenerationGraphCLIParser, TextToSpeechGraphCLIParser, SpeechToTextGraphCLIParser> graphOptionsParser;
 
 public:
     CLIParser() = default;

diff --git a/src/graph_export/BUILD b/src/graph_export/BUILD
@@ -98,3 +98,31 @@ ovms_cc_library(
     ],
     visibility = ["//visibility:public"],
 )
+
+ovms_cc_library(
+    name = "tts_graph_cli_parser",
+    srcs = ["tts_graph_cli_parser.cpp"],
+    hdrs = ["tts_graph_cli_parser.hpp"],
+    deps =  [
+        "@ovms//src/graph_export:graph_cli_parser",
+        "@ovms//src:cpp_headers",
+        "@ovms//src:libovms_server_settings",
+        "@ovms//src:ovms_exit_codes",
+        "@com_github_jarro2783_cxxopts//:cxxopts",
+    ],
+    visibility = ["//visibility:public"],
+)
+
+ovms_cc_library(
+    name = "stt_graph_cli_parser",
+    srcs = ["stt_graph_cli_parser.cpp"],
+    hdrs = ["stt_graph_cli_parser.hpp"],
+    deps =  [
+        "@ovms//src/graph_export:graph_cli_parser",
+        "@ovms//src:cpp_headers",
+        "@ovms//src:libovms_server_settings",
+        "@ovms//src:ovms_exit_codes",
+        "@com_github_jarro2783_cxxopts//:cxxopts",
+    ],
+    visibility = ["//visibility:public"],
+)