Change search
ReferencesLink to record
Permanent link

Direct link
Training production parameters of context-dependent phones for speech recognition
KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH.
1994 (English)In: STL-QPSR, Vol. 35, no 1, 59-90 p.Article in journal (Refereed) Published
Abstract [en]

A representation form of acoustic information in a trained phone library at the production parametric as well as the spectral level is described. The phones are trained in the parametric domain and are transformed to the spectral domain by means of a synthesis procedure. By this twofold description, potentially more powerful procedures for speaker adaptation and generation of unseen triphones can be explored, while the more robust spectral representation can be used for recognition. Context-dependent phones are represented by control parameters to a cascade formant synthesiser. During training, the parameters are extracted using an analysis-by-synthesis technique and the trajectories are approximated by piece-wise linear segments. For recognition, the parameter tracks are transformed to a sequence of spectral subphone states, similar to a Hidden Markov model. Recognition is performed by Viterbi search in a finitestate network. Recognition experiments have been performed on Swedish connected-digit strings pronounced by seven male speakers. In one experiment, unseen triphones were created by concatenating monophones and diphones and interpolating the parameter trajectories between line endpoints. In another, speaker adaptation was based on generalisation of dzflerences of observed triphones from the phone library. With optimum weighting of duration information, the results for cross-speaker recognition, speaker adaptation, and multi-speaker training were 98.5%, 98.9% and 99.1% correct digit recognition, respectively. Preliminary experiments with created unseen triphones show no improvement. In informal listening tests of resynthesised digit strings from concatenation of trained triphones, the speech has been judged as intelligible, however, far from natural.

Place, publisher, year, edition, pages
1994. Vol. 35, no 1, 59-90 p.
National Category
Computer and Information Science
URN: urn:nbn:se:kth:diva-91237OAI: diva2:508934
NR 20140805Available from: 2012-03-11 Created: 2012-03-11Bibliographically approved

Open Access in DiVA

No full text

Search in DiVA

By author/editor
Blomberg, Mats
By organisation
Speech, Music and Hearing, TMH
Computer and Information Science

Search outside of DiVA

GoogleGoogle Scholar
The number of downloads is the sum of all downloads of full texts. It may include eg previous versions that are now no longer available

Total: 14 hits
ReferencesLink to record
Permanent link

Direct link