Change search
ReferencesLink to record
Permanent link

Direct link
Synthetic phoneme prototypes and dynamic voice source adaptation in speech recognition
1993 (English)In: STL-QPSR, Vol. 34, no 4, 97-140 p.Article in journal (Refereed) Published
Abstract [en]

A speech production oriented technique for generating reference spectral data for speech recognition is presented as an alternative to training to natural speech. The potentials of this approach are discussed. In the presented recognition system, the vocabulary and grammar are described as a finite-state network. Phoneme templates are specified in terms of control parameters to a cascade formant synthesiser. Reduction and coarticulation nzodules modify initial phoneme target values and insert interpolated transition states at phoneme boundaries before computing spectral reference data. The recognition process uses a time-synchronous Viterbi search technique. The ejfect of voice source variation upon speech recognition performance is discussed. Experiments using synthetic data show that normal voice quality variation is large enough to cause vowel confusions in a recogniser using Bark cepstrum representation. An adaptation technique is proposed that models the deviation of the voice source from the reference quality in terms of amplitude and spectral balance. This technique reduces the recogniser sensitivity to voice source deviation. To account for medium-term time correlation, an algorithm for dynamic adaptation of mismatch between reference and test data is described. The algorithm has been implemented to adapt to voice source fluctuations in time. In an isolated-word recognition task, the average recognition for ten male speakers was 88% without dynamic source adaptation using a 26-worvocabulary. Adding the voice source adaptation function raised the performance to 96%. On a vocabulary of three connected digits, the digit recognition rate was maximally 96.1 % as an average for six male speakers. Setting the reference voice source model to contain low high-frequency energy level, as in a breathy voice, resulted in a recogrzition rate of 73% without the source adaptation module. Including source adaptation raised the performance to 91 %, showing the power of this component to compensate for this type of mismatch between reference and test data-

Place, publisher, year, edition, pages
1993. Vol. 34, no 4, 97-140 p.
National Category
Computer and Information Science
URN: urn:nbn:se:kth:diva-91459OAI: diva2:510356
NR 20140805Available from: 2012-03-15 Created: 2012-03-15Bibliographically approved

Open Access in DiVA

No full text

Search in DiVA

By author/editor
Blomberg, Mats
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 12 hits
ReferencesLink to record
Permanent link

Direct link