Training speech synthesis parameters of allophones for speech recognition
1994 (English)Conference paper (Refereed)
A technique for training a speech recognition system at a production parametric level is described. The approach offers potential advantages in the form of small training corpora and fast speaker adaptation. Triphones that have not occurred in the training data can be generated by concatenation and parametric interpolation of diphones or context-free phones. The triphones are represented by a piece-wise linear approximation of the production parameters. For recognition, these are converted to subphone spectral state sequences. A 97.6% connected-digit recognition rate has been achieved when training the system on one male speaker and performing recognition on 6 other male speakers. In preliminary experiments with generation of unseen triphones, the performance is still slightly lower compared to using seen diphones and context-free phones. Experiments with fast speaker adaptation is also going on. Resynthesis of speech by concatenating triphones has been used to verify the quality of the triphone library.
Place, publisher, year, edition, pages
1994. 18-21 p.
Computer and Information Science
IdentifiersURN: urn:nbn:se:kth:diva-91238OAI: oai:DiVA.org:kth-91238DiVA: diva2:508935
Working papers from the 8th Swedish Phonetics Conference
NR 201408052012-03-112012-03-11Bibliographically approved