In this paper we present two different methods for mapping auditory, telephonequality speech to visual parameter trajectories, specifying the movements of ananimated synthetic face. In the first method, Hidden Markov Models (HMMs)where used to obtain phoneme strings and time labels. These where thentransformed by rules into parameter trajectories for visual speech synthesis. In thesecond method, Artificial Neural Networks (ANNs) were trained to directly mapacoustic parameters to synthesis parameters. Speaker independent HMMs weretrained on a phonetically transcribed telephone speech database. Differentunderlying units of speech were modelled by the HMMs, such as monophones,diphones, triphones, and visemes. The ANNs were trained on male, female , andmixed speakers.The HMM method and the ANN method were evaluated through audio-visualintelligibility tests with ten hearing impaired persons, and compared to “ideal”articulations (where no recognition was involved), a natural face, and to theintelligibility of the audio alone. It was found that the HMM method performsconsiderably better than the audio alone condition (54% and 34% keywordscorrect, respectively), but not as well as the “ideal” articulating artificial face(64%). The intelligibility for the ANN method was 34% keywords correct.
KTH Royal Institute of Technology, 1999. Vol. 40, no 1-2, 45-50 p.
QC 20100630. QC 20160211