-
Notifications
You must be signed in to change notification settings - Fork 232
Transcript&speech endpoints #3719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
605d6d5 to
284f4ce
Compare
src/test/mediapipeflow_test.cpp
Outdated
| "SerializationCalculator", | ||
| "SetLandmarkVisibilityCalculator", | ||
| "SidePacketToStreamCalculator", | ||
| "SpeechCalculator", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should have 2 now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.
| #include "../../http_payload.hpp" | ||
| #include "../../logging.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #include "../../http_payload.hpp" | |
| #include "../../logging.hpp" | |
| #include "src/http_payload.hpp" | |
| #include "src/logging.hpp" |
| #pragma warning(push) | ||
| #pragma warning(disable : 4245 4220) | ||
| #include "dr_wav.h" // NOLINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use new way of inclusion for libraries that need special compilation flags to not polute code with pragmas with enigmatic numbers hidden in several places.
Check:
https://github.com/openvinotoolkit/model_server/blob/main/src/port/rapidjson_document.hpp
| parser_image_generation.add_argument('--default_num_inference_steps', type=int, default=0, help='Default number of inference steps when not specified by client', dest='default_num_inference_steps') | ||
| parser_image_generation.add_argument('--max_num_inference_steps', type=int, default=0, help='Max allowed number of inference steps client is allowed to request for a given prompt', dest='max_num_inference_steps') | ||
|
|
||
| parser_speech_generation = subparsers.add_parser('speech', help='export model for speech synthesis endpoint') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be renamed to text2speech
| parser_speech_generation.add_argument('--num_streams', default=0, type=int, help='The number of parallel execution streams to use for the models in the pipeline.', dest='num_streams') | ||
| parser_speech_generation.add_argument('--vocoder', type=str, help='The vocoder model to use for speech synthesis. For example microsoft/speecht5_hifigan', dest='vocoder') | ||
|
|
||
| parser_transcription_generation = subparsers.add_parser('transcription', help='export model for speech transcription endpoint') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be renamed to speech2text
demos/audio/openai_speech.py
Outdated
| speech_file_path = Path(__file__).parent / "speech.wav" | ||
| client = OpenAI(base_url=url, api_key="not_used") | ||
|
|
||
| # with client.audio.speech.with_streaming_response.create( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 scripts would with args would be better here - for text2speeech and speech2text
| float input_rate, | ||
| float target_rate, | ||
| size_t* output_length) { | ||
| if (input_rate == target_rate) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the resampling is not needed, why this copy is included?
| return NULL; | ||
| } | ||
|
|
||
| for (size_t i = 0; i < *output_length; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the source of origin of that code?
| } | ||
|
|
||
| size_t output_length; | ||
| auto buffer = resample_audio(reinterpret_cast<float*>(pcmf32.data()), pcmf32.size(), mp3.sampleRate, 16000, &output_length); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add debug messages so we could determine how long it takes to read file, convert it to tensor and resample
|
|
||
| #pragma warning(push) | ||
| #pragma warning(disable : 4245 4220) | ||
| #include "dr_wav.h" // NOLINT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there would be places where we would include just one not the other then split this into separate headers. Otherwise it could be left as is.
| } | ||
|
|
||
| size_t output_length; | ||
| startTime = std::chrono::high_resolution_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could just use src/timer.hpp
like here:
https://github.com/openvinotoolkit/model_server/blob/main/src/modelinstance.cpp#L1284
| #include "absl/strings/escaping.h" | ||
| #include "absl/strings/str_cat.h" | ||
| #pragma warning(pop) | ||
| #define DR_WAV_IMPLEMENTATION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it intended to be always defined? If yest we could move this to src/port/dr_audio.
Or it could be defined as local_defines in src/port/BUILD for dr lib
4b06972 to
3e2863b
Compare
src/audio/audio_utils.hpp
Outdated
| #include "openvino/genai/whisper_pipeline.hpp" | ||
| #include "openvino/genai/speech_generation/text2speech_pipeline.hpp" | ||
|
|
||
| #include <string> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| #include "openvino/genai/whisper_pipeline.hpp" | |
| #include "openvino/genai/speech_generation/text2speech_pipeline.hpp" | |
| #include <string> | |
| #include <string> | |
| #include "openvino/genai/whisper_pipeline.hpp" | |
| #include "openvino/genai/speech_generation/text2speech_pipeline.hpp" |
| #define DR_WAV_IMPLEMENTATION | ||
| #define DR_MP3_IMPLEMENTATION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick:
it would make sense to have dr_audio.cpp with just those 2 defines instead of having it here. Then dr_lib is fully self-contained - it ensures that if it is included in build it is only once.
src/audio/audio_utils.cpp
Outdated
| return output; | ||
| } | ||
| #pragma warning(pop) | ||
| void prepareAudioOutput(void** ppData, size_t& pDataSize, uint16_t bitsPerSample, size_t speechSize, ov::Tensor& cpuTensor) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we just pass here float* and not depend on ov?
src/audio/audio_utils.cpp
Outdated
| timer.stop(RESAMPLING); | ||
| auto resamplingTime = (timer.elapsed<std::chrono::microseconds>(RESAMPLING)) / 1000; | ||
| SPDLOG_LOGGER_DEBUG(stt_calculator_logger, "Resampling time: {} ms", resamplingTime); | ||
| std::vector<float> output(buffer, buffer + outputLength); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is ov::genai::rawSpeechInput the same as vector? Is copy happening here?
src/audio/BUILD
Outdated
| deps = [ | ||
| "//src:libovmslogging", | ||
| "//src/port:dr_audio", | ||
| "//third_party:genai", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if we really need to depend here on ov/genai.
disable warning
ef18657 to
c7ec173
Compare
🛠 Summary
CVS-174567
CVS-174596
POC productization #3683
🧪 Checklist
``