A common phone model representation for speech recognition and synthesis
1994 (English)Conference paper (Refereed)
A combined representation of context-dependent phones at the production parametric and the spectral level is described. The phones are trained in the production domain using analysis-by-synthesis and piece-wise linear approximation of parameter trajectories. For recognition, this representation is transformed to spectral subphones, using a cascade formant synthesis procedure. In a connected-digit recognition task, 99.1% average correct digit rate was achieved in a group of seven male speakers when, for each test speaker, training was done on the other six speakers. Simple rules for male-to-female transformation of the male phone library increased the performance for six female speakers from 88.9% without transformation to 96.3%. In informal listening tests of resynthesised digit strings, the speech has been judged as intelligible, however far from natural.
Place, publisher, year, edition, pages
1994. 1875-1878 p.
Computer and Information Science
IdentifiersURN: urn:nbn:se:kth:diva-91236OAI: oai:DiVA.org:kth-91236DiVA: diva2:508933
Third International Conference on Spoken Language Processing (ICSLP 94)
NR 201408052012-03-112012-03-11Bibliographically approved