- 
                Notifications
    
You must be signed in to change notification settings  - Fork 726
 
Speech recognition basics
The process of speech recognition can be divided into several general steps:
This wiki covers the first four, because the last step is usually carried out by users rather than by the speech recognition system.
Two ways of the audio input are distinguished in speech recognition. They are online and offline. Here online means that we don't know the size of data we are going to process, so the speech recognizer accepts it by small portions and on each step offers the best recognition result. With offline recognition we already have the whole amount of data, so the recognition can be done a bit more precisely.
TODO: CMVN computation
Audio data is represented by a number of real numbers denoting the audio volume in a given moment of time. Numbers can have different precision (sample) format , and usually it is single precision 16-byte little endian is used. Also it's important how many measurements are done in a second. This is called sampling frequency or sampling rate. 16Khz is a standard for everything except telephone networks where frequencies above 4000Hz are filtered out, so the rate of 8Khz is used instead. TODO: finish
Language model is a higher level of recognition. Consider this simple example to get an idea of why this is required. Try to break the string ""... Recognizer meets the same difficulty. Although it seems that people make pauses between words when they speak, in reality they do not - end of one word becomes the beginning of another word. But people do not speak words randomly.
After a recognition hypothesis was obtained, it can be post-processed to extract additional information or enhance with a new one. Those include but are not limited to:
- Punctuation.
 - Number parsing: "one two three" -> "123"
 - Token parsing: ""
 - Sentiment analysis.
 - Emotion recognition.