Transcript&speech endpoints #3719

michalkulakowski · 2025-10-22T09:51:17Z

🛠 Summary

CVS-174567
CVS-174596

POC productization #3683

🧪 Checklist

Unit tests added.
The documentation updated.
Change follows security best practices.
``

atobiszei · 2025-10-22T13:00:47Z

src/test/mediapipeflow_test.cpp

        "SerializationCalculator",
        "SetLandmarkVisibilityCalculator",
        "SidePacketToStreamCalculator",
+        "SpeechCalculator",


We should have 2 now?

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

atobiszei · 2025-10-23T06:47:45Z

src/audio/speech_to_text/stt_calculator.cc

+#include "../../http_payload.hpp"
+#include "../../logging.hpp"


Suggested change

#include "../../http_payload.hpp"

#include "../../logging.hpp"

#include "src/http_payload.hpp"

#include "src/logging.hpp"

atobiszei · 2025-10-23T06:48:47Z

src/audio/text_to_speech/tts_calculator.cc

+#pragma warning(push)
+#pragma warning(disable : 4245 4220)
+#include "dr_wav.h"  // NOLINT


Please use new way of inclusion for libraries that need special compilation flags to not polute code with pragmas with enigmatic numbers hidden in several places.
Check:
https://github.com/openvinotoolkit/model_server/blob/main/src/port/rapidjson_document.hpp

dtrawins · 2025-10-23T08:50:14Z

demos/common/export_models/export_model.py

 parser_image_generation.add_argument('--default_num_inference_steps', type=int, default=0, help='Default number of inference steps when not specified by client', dest='default_num_inference_steps')
 parser_image_generation.add_argument('--max_num_inference_steps', type=int, default=0, help='Max allowed number of inference steps client is allowed to request for a given prompt', dest='max_num_inference_steps')
+
+parser_speech_generation = subparsers.add_parser('speech', help='export model for speech synthesis endpoint')


this should be renamed to text2speech

dtrawins · 2025-10-23T08:51:15Z

demos/common/export_models/export_model.py

+parser_speech_generation.add_argument('--num_streams', default=0, type=int, help='The number of parallel execution streams to use for the models in the pipeline.', dest='num_streams')
+parser_speech_generation.add_argument('--vocoder', type=str, help='The vocoder model to use for speech synthesis. For example microsoft/speecht5_hifigan', dest='vocoder')
+
+parser_transcription_generation = subparsers.add_parser('transcription', help='export model for speech transcription endpoint')


this should be renamed to speech2text

dtrawins · 2025-10-23T08:55:37Z

demos/audio/openai_speech.py

+speech_file_path = Path(__file__).parent / "speech.wav"
+client = OpenAI(base_url=url, api_key="not_used")
+
+# with client.audio.speech.with_streaming_response.create(


2 scripts would with args would be better here - for text2speeech and speech2text

dtrawins · 2025-10-23T13:37:45Z

src/audio/speech_to_text/stt_calculator.cc

+    float input_rate,
+    float target_rate,
+    size_t* output_length) {
+    if (input_rate == target_rate) {


if the resampling is not needed, why this copy is included?

dtrawins · 2025-10-23T13:40:45Z

src/audio/speech_to_text/stt_calculator.cc

+        return NULL;
+    }
+
+    for (size_t i = 0; i < *output_length; i++) {


what is the source of origin of that code?

dtrawins · 2025-10-23T13:42:37Z

src/audio/speech_to_text/stt_calculator.cc

+    }
+
+    size_t output_length;
+    auto buffer = resample_audio(reinterpret_cast<float*>(pcmf32.data()), pcmf32.size(), mp3.sampleRate, 16000, &output_length);


add debug messages so we could determine how long it takes to read file, convert it to tensor and resample

atobiszei · 2025-10-24T10:07:00Z

src/port/dr_audio.hpp

+
+#pragma warning(push)
+#pragma warning(disable : 4245 4220)
+#include "dr_wav.h"  // NOLINT


If there would be places where we would include just one not the other then split this into separate headers. Otherwise it could be left as is.

atobiszei · 2025-10-24T10:09:37Z

src/audio/speech_to_text/stt_calculator.cc

    }

    size_t output_length;
    startTime = std::chrono::high_resolution_clock::now();


You could just use src/timer.hpp
like here:
https://github.com/openvinotoolkit/model_server/blob/main/src/modelinstance.cpp#L1284

atobiszei · 2025-10-24T10:11:19Z

src/audio/speech_to_text/stt_calculator.cc

 #include "absl/strings/escaping.h"
 #include "absl/strings/str_cat.h"
 #pragma warning(pop)
+#define DR_WAV_IMPLEMENTATION


Is it intended to be always defined? If yest we could move this to src/port/dr_audio.
Or it could be defined as local_defines in src/port/BUILD for dr lib

atobiszei · 2025-10-27T09:45:12Z

src/audio/audio_utils.hpp

+#include "openvino/genai/whisper_pipeline.hpp"
+#include "openvino/genai/speech_generation/text2speech_pipeline.hpp"
+
+#include <string>


Suggested change

#include "openvino/genai/whisper_pipeline.hpp"

#include "openvino/genai/speech_generation/text2speech_pipeline.hpp"

#include <string>

#include <string>

#include "openvino/genai/whisper_pipeline.hpp"

#include "openvino/genai/speech_generation/text2speech_pipeline.hpp"

atobiszei · 2025-10-27T09:49:02Z

src/audio/audio_utils.cpp

+#define DR_WAV_IMPLEMENTATION
+#define DR_MP3_IMPLEMENTATION


Nitpick:
it would make sense to have dr_audio.cpp with just those 2 defines instead of having it here. Then dr_lib is fully self-contained - it ensures that if it is included in build it is only once.

atobiszei · 2025-10-27T09:51:50Z

src/audio/audio_utils.cpp

+    return output;
+}
+#pragma warning(pop)
+void prepareAudioOutput(void** ppData, size_t& pDataSize, uint16_t bitsPerSample, size_t speechSize, ov::Tensor& cpuTensor) {


Could we just pass here float* and not depend on ov?

atobiszei · 2025-10-27T09:51:54Z

src/audio/audio_utils.cpp

+    timer.stop(RESAMPLING);
+    auto resamplingTime = (timer.elapsed<std::chrono::microseconds>(RESAMPLING)) / 1000;
+    SPDLOG_LOGGER_DEBUG(stt_calculator_logger, "Resampling time: {} ms", resamplingTime);
+    std::vector<float> output(buffer, buffer + outputLength);


is ov::genai::rawSpeechInput the same as vector? Is copy happening here?

atobiszei · 2025-10-27T09:52:55Z

src/audio/BUILD

+        deps = [
+        "//src:libovmslogging",
+        "//src/port:dr_audio",
+        "//third_party:genai",


I am not sure if we really need to depend here on ov/genai.

disable warning

michalkulakowski changed the title ~~Mkulakow/audio~~ Transcript&speech endpoints Oct 22, 2025

michalkulakowski force-pushed the mkulakow/audio branch from 605d6d5 to 284f4ce Compare October 22, 2025 11:39

michalkulakowski requested review from atobiszei and dtrawins October 22, 2025 11:46

atobiszei reviewed Oct 22, 2025

View reviewed changes

michalkulakowski requested a review from atobiszei October 22, 2025 14:32

dtrawins requested a review from Copilot October 22, 2025 15:01

Copilot AI reviewed Oct 22, 2025

View reviewed changes

atobiszei reviewed Oct 23, 2025

View reviewed changes

dtrawins reviewed Oct 23, 2025

View reviewed changes

michalkulakowski requested review from atobiszei and dtrawins October 24, 2025 09:57

atobiszei reviewed Oct 24, 2025

View reviewed changes

michalkulakowski force-pushed the mkulakow/audio branch from 4b06972 to 3e2863b Compare October 24, 2025 14:22

michalkulakowski requested a review from atobiszei October 24, 2025 14:22

dtrawins approved these changes Oct 24, 2025

View reviewed changes

atobiszei reviewed Oct 27, 2025

View reviewed changes

michalkulakowski added 6 commits October 28, 2025 09:33

Speech pipeline POC

ad4903e

fix

6eb0d4b

update

4a1d621

fix

ec744ef

fix

b188a34

fix

0727122

michalkulakowski and others added 22 commits October 28, 2025 09:35

cleanup

4cea415

style

359b9c9

fix

3e65037

Add calculator

288466c

fix

dc21e31

style

e4b5e9c

fix

4ed9ce4

fix

ec57fee

fix

9b9af0b

fix

4bf65f4

review

cc37fd2

review

0b50bf5

style

4519280

review

28ed221

style

a2f9e00

style

6bbbbbf

fix

60809ac

style

1e093c3

fix

d55f4b3

Update audio_utils.cpp

9c0e649

disable warning

Update audio_utils.cpp

74a9817

fix

c7ec173

michalkulakowski force-pushed the mkulakow/audio branch from ef18657 to c7ec173 Compare October 28, 2025 08:35

michalkulakowski added 2 commits October 28, 2025 09:50

fix

fa28c80

style

0919ee2

michalkulakowski requested a review from atobiszei October 28, 2025 10:26

michalkulakowski added 2 commits October 28, 2025 12:00

fix

184da98

fix

054fef3

atobiszei approved these changes Oct 28, 2025

View reviewed changes

michalkulakowski merged commit 0d1cea9 into main Oct 28, 2025
1 check passed

		#include "../../http_payload.hpp"
		#include "../../logging.hpp"

		#define DR_WAV_IMPLEMENTATION
		#define DR_MP3_IMPLEMENTATION

Transcript&speech endpoints #3719

Transcript&speech endpoints #3719

Uh oh!

Conversation

michalkulakowski commented Oct 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛠 Summary

🧪 Checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

atobiszei Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

michalkulakowski commented Oct 22, 2025 •

edited

Loading

atobiszei Oct 24, 2025 •

edited

Loading