Engaging embodied conversational agents need to generate expressive behavior in order to be believable insocializing interactions. We present a system that can generate spontaneous speech with supporting lip movements. The neural conversational TTSvoice is trained on a multi-style speech corpus that has been prosodically tagged (pitch and speaking rate) and transcribed (including tokens for breathing, fillers and laughter). We introduce a speech animation algorithm where articulatory effort can be adjusted. The facial animation is driven by time-stamped phonemes and prominence estimates from the synthesised speech waveform to modulate the lip and jaw movements accordingly. In objective evaluations we show that the system is able to generate speech and facial animation that vary in articulation effort. In subjective evaluations we compare our conversational TTS system’s capability to deliver jokes with a commercial TTS. Both systems succeeded equally good.
Part of ISBN 9798350345445
QC 20231124