Speech recognition demo FAQ

1. What is speech recognition?

The aim of speech recognition is to convert speech to text using a computer program. See wikipedia for more details and references.

2. How well does it work?

It is hard to answer this question without any other information. Acoustic conditions, pronunciation and the style of language used all affect the quality of the recognition. Using standard language and pronunciating clearly will give the best results. Currently we are getting Letter-Error-Rates of around four percent in a multispeaker setting with clean speech and standard language. The setup for this demo is a 'baseline'-system with no acoustic model adaptations or language model adaptations using single-pass decoding. See the next guestions for more details and now you can of course try it out for yourself :)

3. What kind of acoustic model is used?

The acoustic model is trained using a speech database which has around twenty hours of speech with relatively clean acoustics and several hundreds of both male and female speakers. The model should thus be good 'all-around' model, suiting most speakers. Because the training is done with clean acoustics and we are not using any other noise reduction at the moment, the performance will be weaker for noisy speech.

For the technically oriented, the model is a context-dependent cross-word triphone decision-tree state-clustered Hidden Markov Model with diagonal covariance Gaussian mixture models as density functions coupled with a maximum likelihood linear transformation trained using maximum likelihood principle. The training is done using software developed in the speech recognition group.

4. What kind of language model is used?

The language model is trained using around 100 million sentences from a text database collected mainly from newspapers. The newspaper text is quite formal and because of this the resulting language model is best suited for standard language sentences. Because Finnish is a highly inflective language and has many different word forms, the words are split to statistical morphemes using the Morfessor -toolkit developed in the laboratory to get a good cover for Finnish.

Technically the language model is a morph-based N-gram model trained to varying lengths using the VariKN -toolkit developed in the laboratory.