Speech recognition basics

Abstract concept

The process of speech recognition can be divided into several general steps:

Audio acquisition
Audio processing.
Base units search.
Language model search.
Post-processing.

This wiki covers the first four, because the last step is usually carried out by users rather than by the speech recognition system.

Acquisition

Two ways of the audio input are distinguished in speech recognition. They are online and offline. Here online means that we don't know the size of data we are going to process, so the speech recognizer accepts it by small portions and on each step offers the best recognition result. With offline recognition we already have the whole amount of data, so the recognition can be done a bit more precisely.

TODO: CMVN computation

Processing

Audio data is represented by a number of real numbers denoting the audio volume in a given moment of time. Numbers can have different precision (sample) format , and usually it is single precision 16-byte little endian is used. Also it's important how many measurements are done in a second. This is called sampling frequency or sampling rate. 16Khz is a standard for everything except telephone networks where frequencies above 4000Hz are filtered out, so the rate of 8Khz is used instead. TODO: finish

Base units

Language model

Language model is a higher level of recognition. Consider this simple example to get an idea of why this is required. Try to break the string ""... Recognizer meets the same difficulty. Although it seems that people make pauses between words when they speak, in reality they do not - end of one word becomes the beginning of another word. But people do not speak words randomly.

Fixed grammars

N-gram models

Post-processing

After a recognition hypothesis was obtained, it can be post-processed to extract additional information or enhance with a new one. Those include but are not limited to:

Punctuation.
Number parsing: "one two three" -> "123"
Token parsing: ""
Sentiment analysis.
Emotion recognition.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Speech recognition basics

Abstract concept

Acquisition

Processing

Base units

Language model

Fixed grammars

N-gram models

Post-processing

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally