Skip to content

Conversation

@michalkulakowski
Copy link
Collaborator

@michalkulakowski michalkulakowski commented Oct 22, 2025

🛠 Summary

CVS-174567
CVS-174596

POC productization #3683

🧪 Checklist

  • Unit tests added.
  • The documentation updated.
  • Change follows security best practices.
    ``

@michalkulakowski michalkulakowski changed the title Mkulakow/audio Transcript&speech endpoints Oct 22, 2025
"SerializationCalculator",
"SetLandmarkVisibilityCalculator",
"SidePacketToStreamCalculator",
"SpeechCalculator",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should have 2 now?

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Comment on lines 27 to 28
#include "../../http_payload.hpp"
#include "../../logging.hpp"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "../../http_payload.hpp"
#include "../../logging.hpp"
#include "src/http_payload.hpp"
#include "src/logging.hpp"

Comment on lines 39 to 41
#pragma warning(push)
#pragma warning(disable : 4245 4220)
#include "dr_wav.h" // NOLINT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use new way of inclusion for libraries that need special compilation flags to not polute code with pragmas with enigmatic numbers hidden in several places.
Check:
https://github.com/openvinotoolkit/model_server/blob/main/src/port/rapidjson_document.hpp

parser_image_generation.add_argument('--default_num_inference_steps', type=int, default=0, help='Default number of inference steps when not specified by client', dest='default_num_inference_steps')
parser_image_generation.add_argument('--max_num_inference_steps', type=int, default=0, help='Max allowed number of inference steps client is allowed to request for a given prompt', dest='max_num_inference_steps')

parser_speech_generation = subparsers.add_parser('speech', help='export model for speech synthesis endpoint')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be renamed to text2speech

parser_speech_generation.add_argument('--num_streams', default=0, type=int, help='The number of parallel execution streams to use for the models in the pipeline.', dest='num_streams')
parser_speech_generation.add_argument('--vocoder', type=str, help='The vocoder model to use for speech synthesis. For example microsoft/speecht5_hifigan', dest='vocoder')

parser_transcription_generation = subparsers.add_parser('transcription', help='export model for speech transcription endpoint')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be renamed to speech2text

speech_file_path = Path(__file__).parent / "speech.wav"
client = OpenAI(base_url=url, api_key="not_used")

# with client.audio.speech.with_streaming_response.create(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 scripts would with args would be better here - for text2speeech and speech2text

float input_rate,
float target_rate,
size_t* output_length) {
if (input_rate == target_rate) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the resampling is not needed, why this copy is included?

return NULL;
}

for (size_t i = 0; i < *output_length; i++) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the source of origin of that code?

}

size_t output_length;
auto buffer = resample_audio(reinterpret_cast<float*>(pcmf32.data()), pcmf32.size(), mp3.sampleRate, 16000, &output_length);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add debug messages so we could determine how long it takes to read file, convert it to tensor and resample


#pragma warning(push)
#pragma warning(disable : 4245 4220)
#include "dr_wav.h" // NOLINT
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there would be places where we would include just one not the other then split this into separate headers. Otherwise it could be left as is.

}

size_t output_length;
startTime = std::chrono::high_resolution_clock::now();
Copy link
Collaborator

@atobiszei atobiszei Oct 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#include "absl/strings/escaping.h"
#include "absl/strings/str_cat.h"
#pragma warning(pop)
#define DR_WAV_IMPLEMENTATION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it intended to be always defined? If yest we could move this to src/port/dr_audio.
Or it could be defined as local_defines in src/port/BUILD for dr lib

Comment on lines 19 to 22
#include "openvino/genai/whisper_pipeline.hpp"
#include "openvino/genai/speech_generation/text2speech_pipeline.hpp"

#include <string>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
#include "openvino/genai/whisper_pipeline.hpp"
#include "openvino/genai/speech_generation/text2speech_pipeline.hpp"
#include <string>
#include <string>
#include "openvino/genai/whisper_pipeline.hpp"
#include "openvino/genai/speech_generation/text2speech_pipeline.hpp"

Comment on lines +17 to +18
#define DR_WAV_IMPLEMENTATION
#define DR_MP3_IMPLEMENTATION
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpick:
it would make sense to have dr_audio.cpp with just those 2 defines instead of having it here. Then dr_lib is fully self-contained - it ensures that if it is included in build it is only once.

return output;
}
#pragma warning(pop)
void prepareAudioOutput(void** ppData, size_t& pDataSize, uint16_t bitsPerSample, size_t speechSize, ov::Tensor& cpuTensor) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just pass here float* and not depend on ov?

timer.stop(RESAMPLING);
auto resamplingTime = (timer.elapsed<std::chrono::microseconds>(RESAMPLING)) / 1000;
SPDLOG_LOGGER_DEBUG(stt_calculator_logger, "Resampling time: {} ms", resamplingTime);
std::vector<float> output(buffer, buffer + outputLength);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is ov::genai::rawSpeechInput the same as vector? Is copy happening here?

src/audio/BUILD Outdated
deps = [
"//src:libovmslogging",
"//src/port:dr_audio",
"//third_party:genai",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure if we really need to depend here on ov/genai.

@michalkulakowski michalkulakowski merged commit 0d1cea9 into main Oct 28, 2025
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants