Output transcriptions such as .vtt and .srt files, have correct transcriptions, but the timestamps for what is said when is wrong. A 3 second sentence is marked as 3 minutes instead. It doesn't appear to be exact seconds to minutes though.
What we're outputting
00:00.000 --> 03:36.000
- Things you prioritize, like the most useful
03:36.000 --> 07:44.000
and interesting conversations, it goes parties,
07:44.000 --> 10:52.000
then workshops, then conference session.
10:52.000 --> 15:28.000
- I'm sorry, but you asked to be roasted, I will roast.
15:28.000 --> 17:56.000
- I'm talking about the level of intelligence
17:56.000 --> 20:52.000
of the cat or dog, okay?
What is outputted by yt-whisper and should be output instead
00:00.000 --> 00:06.500
Things you prioritize, like the most useful and interesting conversations, it goes parties, then workshops, then conference sessions.
00:06.500 --> 00:09.200
I'm sorry, but you asked to be roasted, I will roast.
00:09.200 --> 00:12.500
I'm talking about the level of intelligence of a cat or a dog, okay?
Output transcriptions such as
.vttand.srtfiles, have correct transcriptions, but the timestamps for what is said when is wrong. A 3 second sentence is marked as 3 minutes instead. It doesn't appear to be exact seconds to minutes though.What we're outputting
What is outputted by yt-whisper and should be output instead