kth.sePublications
Change search
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf
A Model for Multimodal Dialogue System Output Applied to an Animated Talking Head
KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.ORCID iD: 0000-0003-1399-6604
KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT. KTH, School of Computer Science and Communication (CSC), Speech, Music and Hearing, TMH, Speech Communication and Technology.ORCID iD: 0000-0001-9327-9482
KTH, School of Computer Science and Communication (CSC), Centres, Centre for Speech Technology, CTT.
2005 (English)In: SPOKEN MULTIMODAL HUMAN-COMPUTER DIALOGUE IN MOBILE ENVIRONMENTS / [ed] Minker, Wolfgang; Bühler, Dirk; Dybkjær, Laila, Dordrecht: Springer , 2005, p. 93-113Chapter in book (Refereed)
Abstract [en]

We present a formalism for specifying verbal and non-verbal output from a multimodal dialogue system. The output specification is XML-based and provides information about communicative functions of the output, without detailing the realisation of these functions. The aim is to let dialogue systems generate the same output for a wide variety of output devices and modalities. The formalism was developed and implemented in the multimodal spoken dialogue system AdApt. We also describe how facial gestures in the 3D-animated talking head used within this system are controlled through the formalism.

Place, publisher, year, edition, pages
Dordrecht: Springer , 2005. p. 93-113
Series
Text Speech and Language Technology, ISSN 1386-291X ; 28
Keywords [en]
GESOM, AdApt, Standards, XML, 3D-animation, Gesture, Turn-taking, Lip synchronisation
National Category
Computer and Information Sciences General Language Studies and Linguistics
Identifiers
URN: urn:nbn:se:kth:diva-12751DOI: 10.1007/1-4020-3075-4_6ISI: 000270447900008ISBN: 978-1-4020-3075-8 (print)OAI: oai:DiVA.org:kth-12751DiVA, id: diva2:318657
Note
QC 20100510 ISCA Tutorial and Research Workshop on Multi-Modal Dialogue in Mobile Environments, Kloster Irsee, GERMANY, 2002Available from: 2010-05-10 Created: 2010-05-10 Last updated: 2022-06-25Bibliographically approved
In thesis
1. Talking Heads - Models and Applications for Multimodal Speech Synthesis
Open this publication in new window or tab >>Talking Heads - Models and Applications for Multimodal Speech Synthesis
2003 (English)Doctoral thesis, comprehensive summary (Other scientific)
Abstract [en]

This thesis presents work in the area of computer-animatedtalking heads. A system for multimodal speech synthesis hasbeen developed, capable of generating audiovisual speechanimations from arbitrary text, using parametrically controlled3D models of the face and head. A speech-specific directparameterisation of the movement of the visible articulators(lips, tongue and jaw) is suggested, along with a flexiblescheme for parameterising facial surface deformations based onwell-defined articulatory targets.

To improve the realism and validity of facial and intra-oralspeech movements, measurements from real speakers have beenincorporated from several types of static and dynamic datasources. These include ultrasound measurements of tonguesurface shape, dynamic optical motion tracking of face pointsin 3D, as well as electromagnetic articulography (EMA)providing dynamic tongue movement data in 2D. Ultrasound dataare used to estimate target configurations for a complex tonguemodel for a number of sustained articulations. Simultaneousoptical and electromagnetic measurements are performed and thedata are used to resynthesise facial and intra-oralarticulation in the model. A robust resynthesis procedure,capable of animating facial geometries that differ in shapefrom the measured subject, is described.

To drive articulation from symbolic (phonetic) input, forexample in the context of a text-to-speech system, bothrule-based and data-driven articulatory control models havebeen developed. The rule-based model effectively handlesforward and backward coarticulation by targetunder-specification, while the data-driven model uses ANNs toestimate articulatory parameter trajectories, trained ontrajectories resynthesised from optical measurements. Thearticulatory control models are evaluated and compared againstother data-driven models trained on the same data. Experimentswith ANNs for driving the articulation of a talking headdirectly from acoustic speech input are also reported.

A flexible strategy for generation of non-verbal facialgestures is presented. It is based on a gesture libraryorganised by communicative function, where each function hasmultiple alternative realisations. The gestures can be used tosignal e.g. turn-taking, back-channelling and prominence whenthe talking head is employed as output channel in a spokendialogue system. A device independent XML-based formalism fornon-verbal and verbal output in multimodal dialogue systems isproposed, and it is described how the output specification isinterpreted in the context of a talking head and converted intofacial animation using the gesture library.

Through a series of audiovisual perceptual experiments withnoise-degraded audio, it is demonstrated that the animatedtalking head provides significantly increased intelligibilityover the audio-only case, in some cases not significantly belowthat provided by a natural face.

Finally, several projects and applications are presented,where the described talking head technology has beensuccessfully employed. Four different multimodal spokendialogue systems are outlined, and the role of the talkingheads in each of the systems is discussed. A telecommunicationapplication where the talking head functions as an aid forhearing-impaired users is also described, as well as a speechtraining application where talking heads and languagetechnology are used with the purpose of improving speechproduction in profoundly deaf children.

Place, publisher, year, edition, pages
Institutionen för talöverföring och musikakustik, 2003. p. viii, 63
Series
Trita-TMH ; 2003:7
Keywords
Talking heads, facial animation, speech synthesis, coarticulation, intelligibility, embodied conversational agents
Identifiers
urn:nbn:se:kth:diva-3561 (URN)91-7283-536-2 (ISBN)
Public defence
2003-06-11, 00:00 (English)
Note
QC 20100506Available from: 2003-06-26 Created: 2003-06-26 Last updated: 2022-06-22Bibliographically approved

Open Access in DiVA

No full text in DiVA

Other links

Publisher's full text

Authority records

Beskow, JonasEdlund, Jens

Search in DiVA

By author/editor
Beskow, JonasEdlund, JensNordstrand, Magnus
By organisation
Centre for Speech Technology, CTTSpeech Communication and Technology
Computer and Information SciencesGeneral Language Studies and Linguistics

Search outside of DiVA

GoogleGoogle Scholar

doi
isbn
urn-nbn

Altmetric score

doi
isbn
urn-nbn
Total: 488 hits
CiteExportLink to record
Permanent link

Direct link
Cite
Citation style
  • apa
  • ieee
  • modern-language-association-8th-edition
  • vancouver
  • Other style
More styles
Language
  • de-DE
  • en-GB
  • en-US
  • fi-FI
  • nn-NO
  • nn-NB
  • sv-SE
  • Other locale
More languages
Output format
  • html
  • text
  • asciidoc
  • rtf