Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 100 additions & 9 deletions demos/audio/README.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,116 @@
# Audio endpoints
# How to serve audio models via OpenAI API {#ovms_demos_continuous_batching_vlm}

This demo shows how to deploy audio in the OpenVINO Model Server.
Speech generation and speech recognition use cases are exposed via OpenAI API `audio/speech` end `audio/transcriptions` endpoint.

## Audio synthesis
## Prerequisites

python export_model.py text2speech --source_model microsoft/speecht5_tts --vocoder microsoft/speecht5_hifigan --weight-format fp16
**OVMS version 2025.4** This demo require version 2025.4 or newer.

docker run -p 8000:8000 -d -v $(pwd)/models/:/models openvino/model_server --model_name speecht5_tts --model_path /models/microsoft/speecht5_tts --rest_port 8000
**Model preparation**: Python 3.9 or higher with pip and HuggingFace account

curl http://localhost/v3/audio/speech -H "Content-Type: application/json" -d "{\"model\": \"speecht5_tts\", \"input\": \"The quick brown fox jumped over the lazy dog.\"}" -o audio.wav
**Model Server deployment**: Installed Docker Engine or OVMS binary package according to the [baremetal deployment guide](../../../docs/deploying_server_baremetal.md)

**(Optional) Client**: git and Python for using OpenAI client package

## Audio transcription

python export_model.py speech2text --source_model openai/whisper-large-v2 --weight-format fp16 --target_device GPU
## Speech generation
### Model preparation
Only supported model for speech generation use case is microsoft/speecht5_tts which is outside OpenVINO organization and needs convertion to IR format.

Specific OVMS pull mode example for models outside of OpenVINO organization is described in section `## Pulling models outside of OpenVINO organization` in the [Ovms pull mode](https://github.com/openvinotoolkit/model_server/blob/main/docs/pull_hf_models.md)

docker run -p 8000:8000 -it --device /dev/dri -u 0 -v $(pwd)/models/:/models openvino/model_server --model_name whisper --model_path /models/openai/whisper-large-v2 --rest_port 8000
Or you can use the python export_model.py script described below.

Here, the original TTS model will be converted to IR format and optionally quantized.
That ensures faster initialization time, better performance and lower memory consumption.
Execution parameters will be defined inside the `graph.pbtxt` file.

curl http://localhost/v3/audio/transcriptions -H "Content-Type: multipart/form-data" -F file="@audio.wav" -F model="whisper"
Download export script, install it's dependencies and create directory for the models:
```console
curl https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/export_models/export_model.py -o export_model.py
pip3 install -r https://raw.githubusercontent.com/openvinotoolkit/model_server/refs/heads/releases/2025/3/demos/common/export_models/requirements.txt
mkdir models
```

Run `export_model.py` script to download and quantize the model:

> **Note:** The users in China need to set environment variable HF_ENDPOINT="https://hf-mirror.com" before running the export script to connect to the HF Hub.

**CPU**
```console
python export_model.py text_to_speech --source_model microsoft/speecht5_tts --weight-format int4 --pipeline_type VLM --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models
```

**GPU**
```console
python export_model.py text_generation --source_model microsoft/speecht5_tts --weight-format int4 --pipeline_type VLM --model_name microsoft/speecht5_tts --config_file_path models/config.json --model_repository_path models --overwrite_models --target_device GPU
```

> **Note:** Change the `--weight-format` to quantize the model to `fp16` or `int8` precision to reduce memory consumption and improve performance.


You should have a model folder like below:
```
models/
├── config.json
└── speecht5_tts
├── added_tokens.json
├── config.json
├── generation_config.json
├── graph.pbtxt
├── openvino_decoder_model.bin
├── openvino_decoder_model.xml
├── openvino_detokenizer.bin
├── openvino_detokenizer.xml
├── openvino_encoder_model.bin
├── openvino_encoder_model.xml
├── openvino_postnet.bin
├── openvino_postnet.xml
├── openvino_tokenizer.bin
├── openvino_tokenizer.xml
├── openvino_vocoder.bin
├── openvino_vocoder.xml
├── preprocessor_config.json
├── special_tokens_map.json
├── spm_char.model
└── tokenizer_config.json
```

### Request Generation

:::{dropdown} **Unary call with curl using image url**


```bash
curl http://localhost:8000/v3/audio/speech -H "Content-Type: application/json" -d "{\"model\": \"microsoft/speecht5_tts\", \"input\": \"The quick brown fox jumped over the lazy dog.\"}" -o speech.wav
```
:::

:::{dropdown} **Unary call with python requests library**

```python
from pathlib import Path
from openai import OpenAI

prompt = "The quick brown fox jumped over the lazy dog"
filename = "speech.wav"
url="http://localhost:8000/v3"


speech_file_path = Path(__file__).parent / "speech.wav"
client = OpenAI(base_url=url, api_key="not_used")

with client.audio.speech.with_streaming_response.create(
model="microsoft/speecht5_tts",
voice="unused",
input=prompt
) as response:
response.stream_to_file(speech_file_path)


print("Generation finished")
```
:::

speech.wav file contains generated speech.
4 changes: 2 additions & 2 deletions demos/common/export_models/export_model.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ def add_common_arguments(parser):
[type.googleapis.com / mediapipe.TtsCalculatorOptions]: {
models_path: "{{model_path}}",
plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }',
device: "{{target_device|default("CPU", true)}}"
target_device: "{{target_device|default("CPU", true)}}"
}
}
}
Expand All @@ -128,7 +128,7 @@ def add_common_arguments(parser):
[type.googleapis.com / mediapipe.SttCalculatorOptions]: {
models_path: "{{model_path}}",
plugin_config: '{ "NUM_STREAMS": "{{num_streams|default(1, true)}}" }',
device: "{{target_device|default("CPU", true)}}"
target_device: "{{target_device|default("CPU", true)}}"
}
}
}
Expand Down
2 changes: 2 additions & 0 deletions src/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -305,6 +305,8 @@ ovms_cc_library(
"//src/graph_export:graph_cli_parser",
"//src/graph_export:rerank_graph_cli_parser",
"//src/graph_export:embeddings_graph_cli_parser",
"//src/graph_export:tts_graph_cli_parser",
"//src/graph_export:stt_graph_cli_parser",
"//src/graph_export:image_generation_graph_cli_parser",
],
visibility = ["//visibility:public",],
Expand Down
2 changes: 1 addition & 1 deletion src/audio/speech_to_text/stt_calculator.proto
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,6 @@ message SttCalculatorOptions {

// fields required for GenAI pipeline initialization
required string models_path = 1;
optional string device = 2;
optional string target_device = 2;
optional string plugin_config = 3;
}
2 changes: 1 addition & 1 deletion src/audio/text_to_speech/tts_calculator.proto
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,6 @@ message TtsCalculatorOptions {

// fields required for GenAI pipeline initialization
required string models_path = 1;
optional string device = 2;
optional string target_device = 2;
optional string plugin_config = 3;
}
25 changes: 24 additions & 1 deletion src/capi_frontend/server_settings.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ enum GraphExportType : unsigned int {
RERANK_GRAPH,
EMBEDDINGS_GRAPH,
IMAGE_GENERATION_GRAPH,
TEXT_TO_SPEECH_GRAPH,
SPEECH_TO_TEXT_GRAPH,
UNKNOWN_GRAPH
};

Expand All @@ -43,13 +45,17 @@ const std::map<GraphExportType, std::string> typeToString = {
{RERANK_GRAPH, "rerank"},
{EMBEDDINGS_GRAPH, "embeddings"},
{IMAGE_GENERATION_GRAPH, "image_generation"},
{TEXT_TO_SPEECH_GRAPH, "text_to_speech"},
{SPEECH_TO_TEXT_GRAPH, "speech_to_text"},
{UNKNOWN_GRAPH, "unknown_graph"}};

const std::map<std::string, GraphExportType> stringToType = {
{"text_generation", TEXT_GENERATION_GRAPH},
{"rerank", RERANK_GRAPH},
{"embeddings", EMBEDDINGS_GRAPH},
{"image_generation", IMAGE_GENERATION_GRAPH},
{"text_to_speech", TEXT_TO_SPEECH_GRAPH},
{"speech_to_text", SPEECH_TO_TEXT_GRAPH},
{"unknown_graph", UNKNOWN_GRAPH}};

std::string enumToString(GraphExportType type);
Expand Down Expand Up @@ -120,6 +126,22 @@ struct EmbeddingsGraphSettingsImpl {
std::string pooling = "CLS";
};

struct TextToSpeechGraphSettingsImpl {
std::string modelPath = "./";
std::string targetDevice = "CPU";
std::string modelName = "";
uint32_t numStreams = 1;
};


struct SpeechToTextGraphSettingsImpl {
std::string modelPath = "./";
std::string targetDevice = "CPU";
std::string modelName = "";
uint32_t numStreams = 1;
};


struct RerankGraphSettingsImpl {
std::string modelPath = "./";
std::string targetDevice = "CPU";
Expand All @@ -146,6 +168,7 @@ struct ImageGenerationGraphSettingsImpl {
struct ExportSettings {
std::string targetDevice = "CPU";
std::optional<std::string> extraQuantizationParams;
std::optional<std::string> vocoder;
std::string precision = "int8";
};

Expand All @@ -157,7 +180,7 @@ struct HFSettingsImpl {
bool overwriteModels = false;
ModelDownlaodType downloadType = GIT_CLONE_DOWNLOAD;
GraphExportType task = TEXT_GENERATION_GRAPH;
std::variant<TextGenGraphSettingsImpl, RerankGraphSettingsImpl, EmbeddingsGraphSettingsImpl, ImageGenerationGraphSettingsImpl> graphSettings;
std::variant<TextGenGraphSettingsImpl, RerankGraphSettingsImpl, EmbeddingsGraphSettingsImpl, TextToSpeechGraphSettingsImpl, SpeechToTextGraphSettingsImpl, ImageGenerationGraphSettingsImpl> graphSettings;
};

struct ServerSettingsImpl {
Expand Down
40 changes: 39 additions & 1 deletion src/cli_parser.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,8 @@
#include "graph_export/graph_cli_parser.hpp"
#include "graph_export/rerank_graph_cli_parser.hpp"
#include "graph_export/embeddings_graph_cli_parser.hpp"
#include "graph_export/tts_graph_cli_parser.hpp"
#include "graph_export/stt_graph_cli_parser.hpp"
#include "graph_export/image_generation_graph_cli_parser.hpp"
#include "ovms_exit_codes.hpp"
#include "filesystem.hpp"
Expand Down Expand Up @@ -209,7 +211,7 @@ void CLIParser::parse(int argc, char** argv) {
cxxopts::value<std::string>(),
"MODEL_REPOSITORY_PATH")
("task",
"Choose type of model export: text_generation - chat and completion endpoints, embeddings - embeddings endpoint, rerank - rerank endpoint, image_generation - image generation/edit/inpainting endpoints.",
"Choose type of model export: text_generation - chat and completion endpoints, embeddings - embeddings endpoint, rerank - rerank endpoint, image_generation - image generation/edit/inpainting endpoints, text_to_speech - audio/speech endpoint, speech_to_text - audio/transcriptions endpoint.",
cxxopts::value<std::string>(),
"TASK")
("weight-format",
Expand All @@ -219,6 +221,10 @@ void CLIParser::parse(int argc, char** argv) {
("extra_quantization_params",
"Model quantization parameters used in optimum-cli export with conversion for text generation models",
cxxopts::value<std::string>(),
"EXTRA_QUANTIZATION_PARAMS")
("vocoder",
"The vocoder model to use for text2speech. For example microsoft/speecht5_hifigan",
cxxopts::value<std::string>(),
"EXTRA_QUANTIZATION_PARAMS");

options->add_options("single model")
Expand Down Expand Up @@ -334,6 +340,18 @@ void CLIParser::parse(int argc, char** argv) {
this->graphOptionsParser = std::move(cliParser);
break;
}
case TEXT_TO_SPEECH_GRAPH: {
TextToSpeechGraphCLIParser cliParser;
unmatchedOptions = cliParser.parse(result->unmatched());
this->graphOptionsParser = std::move(cliParser);
break;
}
case SPEECH_TO_TEXT_GRAPH: {
SpeechToTextGraphCLIParser cliParser;
unmatchedOptions = cliParser.parse(result->unmatched());
this->graphOptionsParser = std::move(cliParser);
break;
}
case UNKNOWN_GRAPH: {
std::cerr << "error parsing options - --task parameter unsupported value: " + result->operator[]("task").as<std::string>();
exit(OVMS_EX_USAGE);
Expand Down Expand Up @@ -411,6 +429,8 @@ void CLIParser::parse(int argc, char** argv) {
RerankGraphCLIParser parser2;
EmbeddingsGraphCLIParser parser3;
ImageGenerationGraphCLIParser imageGenParser;
TextToSpeechGraphCLIParser ttsParser;
SpeechToTextGraphCLIParser sttParser;
parser1.printHelp();
parser2.printHelp();
parser3.printHelp();
Expand Down Expand Up @@ -635,6 +655,8 @@ void CLIParser::prepareGraph(ServerSettingsImpl& serverSettings, HFSettingsImpl&
hfSettings.exportSettings.precision = result->operator[]("weight-format").as<std::string>();
if (result->count("extra_quantization_params"))
hfSettings.exportSettings.extraQuantizationParams = result->operator[]("extra_quantization_params").as<std::string>();
if (result->count("vocoder"))
hfSettings.exportSettings.vocoder = result->operator[]("vocoder").as<std::string>();
if (result->count("model_repository_path"))
hfSettings.downloadPath = result->operator[]("model_repository_path").as<std::string>();
if (result->count("task")) {
Expand Down Expand Up @@ -672,6 +694,22 @@ void CLIParser::prepareGraph(ServerSettingsImpl& serverSettings, HFSettingsImpl&
}
break;
}
case TEXT_TO_SPEECH_GRAPH: {
if (std::holds_alternative<TextToSpeechGraphCLIParser>(this->graphOptionsParser)) {
std::get<TextToSpeechGraphCLIParser>(this->graphOptionsParser).prepare(serverSettings.serverMode, hfSettings, modelName);
} else {
throw std::logic_error("Tried to prepare graph settings without graph parser initialization");
}
break;
}
case SPEECH_TO_TEXT_GRAPH: {
if (std::holds_alternative<SpeechToTextGraphCLIParser>(this->graphOptionsParser)) {
std::get<SpeechToTextGraphCLIParser>(this->graphOptionsParser).prepare(serverSettings.serverMode, hfSettings, modelName);
} else {
throw std::logic_error("Tried to prepare graph settings without graph parser initialization");
}
break;
}
case UNKNOWN_GRAPH: {
throw std::logic_error("Error: --task parameter unsupported value: " + result->operator[]("task").as<std::string>());
break;
Expand Down
4 changes: 3 additions & 1 deletion src/cli_parser.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,8 @@
#include "graph_export/graph_cli_parser.hpp"
#include "graph_export/rerank_graph_cli_parser.hpp"
#include "graph_export/embeddings_graph_cli_parser.hpp"
#include "graph_export/tts_graph_cli_parser.hpp"
#include "graph_export/stt_graph_cli_parser.hpp"
#include "graph_export/image_generation_graph_cli_parser.hpp"

namespace ovms {
Expand All @@ -34,7 +36,7 @@ struct ModelsSettingsImpl;
class CLIParser {
std::unique_ptr<cxxopts::Options> options;
std::unique_ptr<cxxopts::ParseResult> result;
std::variant<GraphCLIParser, RerankGraphCLIParser, EmbeddingsGraphCLIParser, ImageGenerationGraphCLIParser> graphOptionsParser;
std::variant<GraphCLIParser, RerankGraphCLIParser, EmbeddingsGraphCLIParser, ImageGenerationGraphCLIParser, TextToSpeechGraphCLIParser, SpeechToTextGraphCLIParser> graphOptionsParser;

public:
CLIParser() = default;
Expand Down
28 changes: 28 additions & 0 deletions src/graph_export/BUILD
Original file line number Diff line number Diff line change
Expand Up @@ -98,3 +98,31 @@ ovms_cc_library(
],
visibility = ["//visibility:public"],
)

ovms_cc_library(
name = "tts_graph_cli_parser",
srcs = ["tts_graph_cli_parser.cpp"],
hdrs = ["tts_graph_cli_parser.hpp"],
deps = [
"@ovms//src/graph_export:graph_cli_parser",
"@ovms//src:cpp_headers",
"@ovms//src:libovms_server_settings",
"@ovms//src:ovms_exit_codes",
"@com_github_jarro2783_cxxopts//:cxxopts",
],
visibility = ["//visibility:public"],
)

ovms_cc_library(
name = "stt_graph_cli_parser",
srcs = ["stt_graph_cli_parser.cpp"],
hdrs = ["stt_graph_cli_parser.hpp"],
deps = [
"@ovms//src/graph_export:graph_cli_parser",
"@ovms//src:cpp_headers",
"@ovms//src:libovms_server_settings",
"@ovms//src:ovms_exit_codes",
"@com_github_jarro2783_cxxopts//:cxxopts",
],
visibility = ["//visibility:public"],
)
Loading